Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources
上传者:戴文刚|上传时间:2015-05-07|密次下载
Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources
PerformanceAnalysisofThreadMappingswitha
HolisticViewoftheHardwareResources
WeiWang,TanimaDey,JasonMars,LingjiaTang,JackW.Davidson,andMaryLouSoffa
DepartmentofComputerScience
UniversityofVirginiaCharlottesville,VA22904
Email:{wwang,td8h,jom5x,lt8f,jwd,soffa}@virginia.edu
Abstract—Withtheshifttochipmultiprocessors,managingsharedresourceshasbecomeacriticalissueinrealizingtheirfullpotential.Previousresearchhasshownthatthreadmappingisapowerfultoolforresourcemanagement.However,thedif?cultyofsimultaneouslymanagingmultiplehardwareresourcesandthevaryingnatureoftheworkloadshaveimpededtheef?ciencyofthreadmappingalgorithms.Toovercomethedif?cultiesofsimultaneouslymanagingmultipleresourceswiththreadmapping,theinteractionbetweenvariousmicroarchitecturalresourcesandthreadcharacteristicsmustbewellunderstood.
Thispaperpresentsanin-depthanalysisofPARSECbench-marksrunningunderdifferentthreadmappingstoinvestigatetheinteractionofvariousthreadmappingswithmicroarchitecturalresourcesincluding,L1I/D-caches,I/DTLBs,L2caches,hardwareprefetchers,off-chipmemoryinterconnects,branchpredictors,memorydisambiguationunitsandthecores.Foreachresource,theanalysisprovidesguidelinesforhowtoimproveitsutilizationwhenmappingthreadswithdifferentcharacteristics.Wealsoanalyzehowtherelativeimportanceoftheresourcesvariesdependingontheworkloads.Ourexperimentsshowthatwhenonlymemoryresourcesareconsidered,threadmappingimprovesanapplica-tion’sperformancebyasmuchas14%overthedefaultLinuxscheduler.Incontrast,whenbothmemoryandprocessorresourcesareconsideredthemappingalgorithmachievesperformanceimprovementsbyasmuchas28%.Additionally,wedemonstratethatthreadmappingshouldconsiderL2caches,prefetchersandoff-chipmemoryinterconnectsasoneresource,andwepresentanewmetriccalledL2-misses-memory-latency-product(L2MP)forevaluatingtheiraggregatedperformanceimpact.
I.INTRODUCTION
Comparedtotraditionaluniprocessors,chipmultiprocessors(CMPs)greatlyimprovesystemthroughputbyofferingcom-putationalresourcesthatallowmultiplethreadstoexecuteinparallel.Torealizethefullpotentialofthesepowerfulplatforms,ef?cientlymanagingtheresourcesthataresharedbythesesimultaneouslyexecutingthreadshasbecomeacriticalissue.
Inthispaper,wefocusonmanagingCMPsharedresourcesthroughthreadmapping.Previousresearchhasshownthatthreadmappingisapowerfultoolformanagingresources[6,8,17,22,26].However,despitetheintensiveandextensiveresearchonthistopic,properlymappingthreadstoachievetheoptimalperformanceforanarbitraryworkloadisstillanopenquestion.Findingtheoptimalthreadmappingisextremelydif?cultbecauseonemustconsiderallrelevantresourcesandtheinteractionbetweenthesemanyresourcesisworkloaddependent.PreviousresearchhasshownthatL2caches,front-side-busandprefetchershavetobeconsideredwhenmanaging
memoryhierarchyresources[29].Howeveritremainsunclearwhetherthereareadditionalresourcesthatshouldbeconsid-ered,andhowtoholisticallyimprovetheirutilizationbasedontheworkloadcharacteristics.
Toholisticallymanagemultipleresourceswiththreadmap-ping,therearethreemajorchallenges.
1)The?rstchallengeistoidentifythekeyresourcesthatneedtobeconsideredbythreadmappingalgorithms.Neglectingthekeyresourceswouldresultinsuboptimalperformance.
2)Thesecondchallengeistodeterminehowtomapthreadstoimprovetheutilizationofeachkeyresource.Thebestthreadmappingalsodependsonthethreadrun-timecharacteristics.Foreachkeyresource,weneedtoidentifytherelatedthreadrun-timecharacteristicsanddeterminehowtomapthreadswhentheyexhibitthesecharacteristics.
3)Thethirdchallengeistohandlesituationswherenothreadmappingcanimprovetheutilizationofallkeyresources.Undersuchcircumstances,threadmappingalgorithmsmustprioritizetheresourcesandfocusonimprovingtheutilizationofresourcesthatcanprovidethemaximumbene?t.Previousresearchonthreadmappingfocusedonimprovingtheutilizationoftheresourceswithinthememoryhierar-chy[29]oronlyfocusonindividualresource[6,17,26].Althoughtheproposedapproachesaresuccessfulinimprov-ingtheutilizationoftheseresources,thebestapplicationperformanceisnotalwaysguaranteed[28].Moreover,mostpreviousworkhasbeendoneusingsingle-threadedworkloads,whileemergingworkloadsincreasinglyincludemulti-threadedprograms.Multi-threadedworkloadshavedifferentrun-timecharacteristics,thusrequiredifferentmappingstrategies.
Asa?rststeptowardsovercomingthechallengesofholis-ticallymanagingmultipleresources,weprovideanin-depthperformanceanalysisofallpossiblethreadmappingsforasetofworkloadscreatedusingapplicationsfromthemulti-threadedPARSECbenchmarksuite[4].Whileotherworkhaslookedatthememoryhierarchy,inthisworkwetakeaholisticlookatboththememoryresourcesandprocessorresources(e.g.,branchpredictors,memorydisambiguationunit,etc.),andeval-uatetheirrelativeimportance.Inthisanalysis,weidentifythekeyresourcesthatareresponsibleforperformancedifferences.
内容需要下载文档才能查看 内容需要下载文档才能查看
relatedthreadswiththesecharacteristicstoimprovetheutilizationofthekeyresources.Additionally,tohelpmaketrade-offdecisions,weanalyzetherelativeimportanceofthekeyresourcesforeachworkload,andinvestigatethereasonforprioritizingsomeresourcesovertheothers.Weobservethat,byfocusingonmultipleresources,properthreadmappingcanimproveanapplication’sperformancebyupto28%overcurrentLinuxscheduler,whileconsiderationofonlymemoryresourcesprovidesimprovementofonly14%.
Speci?cally,thecontributionsofthispaperinclude:
1)Anin-depthanalysisthatidenti?esthekeyhardwareresourcesthatmustbeconsideredbythreadmappingalgorithms,aswellasthelessimportantresourcesthatdonotneedtobeconsidered.Unlikepreviousworkthatconsideredonlysharedmemoryresourcesformappingsingle-threadedapplications,ourpaperdemonstratesthatformulti-threadedapplications,threadmappinghastoconsidermoreresources,andthreadcharacteristicsforbetterperformance.
2)Ananalysisofhowtoimproveeachkeyresource’sutilizationwiththreadmappingwhenmanagingthreadswithdifferentrun-timecharacteristics.Tothebestofourknowledge,thisanalysisisthe?rstthatinvestigatesthecharacteristicsofmulti-threadedworkloadsandtheirimplicationsformanagingbothmemoryandprocessorresourceswiththreadmapping.
3)AnanalysisshowsthatL2caches,prefetchersandmem-oryinterconnectionsshouldbeconsideredasoneresourcebecauseoftheircomplexinteractions.WealsoproposeanewmetricL2-misses-memory-latency-product(L2MP)tomeasuretheiraggregatedperformanceimpact.
4)Ananalysisthatidenti?estherankingofthekeyre-sourcesforeachworkload,andthereasonfortheranking.Theremainderofthispaperisorganizedasfollows:Sec-tionIIprovidesanoverviewofhardwareresourcesandthethreadcharacteristicsconsideredinouranalysis.SectionIIIidenti?esthekeyresourcesforthreadmappingalgorithms.SectionIVanalyzeshowtoimprovetheutilizationofindividualresourcesviathreadmapping.SectionVdiscussesusingthere-sourcerankingstosimultaneouslymanagingmultipleresources.SectionVIsummarizestheperformanceresults.SectionVIIdiscussesrelatedworkandSectionVIIIconcludesthepaper.II.PERFORMANCEANALYSISOVERVIEW
Toaddressthechallengesmentionedabove,weperformacomprehensiveanalysisofhowanapplication’sperformanceiseffectedwhenthreadswithvariouscharacteristicssharemultiplehardwareresources.Thissectiongivesanoverviewoftheresources,themetrics,therun-timecharacteristics,andthethreadmappingsthatarecoveredinthisanalysis.A.HardwareResources
WeaddresstheresourcesthatarecommonlyavailableoncurrentCMPprocessors.Fig.1givesaschematicviewoftheresourcesprovidedbyanIntelquad-coreprocessor.The
resourcesweconsidercanbeclassi?edintotwocategories:thememoryhierarchyresourcesandtheprocessorresources.
MemoryhierarchyresourcesincludeL1instructioncaches(I-cache),L1datacaches(D-cache),instructionanddatatranslationlook-asidebuffers(I/DTLB),L2caches,hardwareprefetchersandoff-chipmemoryinterconnect.WeuseanIntelCore2processor,whichhastwoprefetchmechanisms,DataPrefetchLogic(DPL)andL2StreamingPrefetch[16].DPLfetchesastreamofinstructionsanddatafrommemoryifastridememoryaccesspatternisdetected.L2StreamingPrefetchbringstwoadjacentcachelinesintoanL2cache.
Processorresourcesincludethecoresandcomponentsthatrequiretrainingtofunction,suchasbranchpredictorsandmemorydisambiguationunits.Inthispaper,weusethetermResourceCoreswhenwediscussthecoreasaresource.B.Metrics
Thenumberofcachemisses,theamountofmemorytrans-actionsandmemoryaccesslatencyareusedtoevaluatetheutilizationofthememoryhierarchyresources.ThenumberofmispredictionsandstalledCPUcyclesduetothesemis-predictionsareusedtoevaluatethetraining-basedprocessorresources.
ProcessorutilizationisusedtoevaluatetheutilizationofResourceCores.Notethat,thisprocessorutilizationisviewedfromtheOSperspective.Forexample,supposethereisonethreadthatrunssolelyonacore.DuetoI/Ooperationsorsynchronizations,halfoftheexecutiontimeofthisthreadisstayingintheOSwaitingqueue.Thenthecorethatexecutesthisthreadhasaprocessorutilizationofonly50%.Inthispaper,theprocessorutilizationreferstotheoverallprocessorutilizationofallcoresinthesystem.
Weusetwometricstoevaluatetheperformanceoftheapplications:thenumberofCPUcyclesconsumedandtheexe-cutiontime.Memoryresourcesandthetraining-basedprocessorresourcesimpactthetotalcyclesconsumedbyanapplication.Accordingly,weuseexecutedCPUcyclestoestimatetheperformanceimpactoftheseresources.Theexecutiontime,ontheotherhand,isaffectedbybothprocessorutilizationandtheexecutedcycles.Weuseexecutiontimewhenevaluatingtheperformanceimpactofalltheresources.Therelationofexecutiontime,executedcyclesandprocessorutilizationisdescribedbyequation(1).
ExecTime=
ExecCycles
Num×Proc×Frequency
(1)
NumandFrequencyrefertothenumberofcoresandtheprocessorfrequency,respectively.Supposethereisanapplicationwithfourthreadsrunningonaquad-coreproces-sorof1GHz.Eachofthefourthreadrequires500million
CPUcyclestoexecute,andeachthreadhas50%processorutilizationduetoI/Ooperationsandsynchronizations.Thenforthisapplication,itsexecutiontimeis(500M(cycles)×4(threads))/(4(cores)×50%×1GHz)=1second.
Allofthemetricsmentionedinthissectioncanbeacquiredfromperformancemonitoringunits(PMUs)[16].TABLEIgivesthenameofthePMUsweusedinouranalysis.
WecomputememoryaccesslatencyfromPMUsusingtheequation(2)proposedbyEranian[13].Essentially,inequa-tion2,memorylatencyiscomputedbydividingthetotalcyclesofallmemoryreadtransactionsbythenumberofmemoryreads.
Memlatency=
BUSSTANDING
BUSRANSBRD?BUSRANSIFETCH
(2)
C.ThreadCharacteristics
Threadcharacteristicsincludethepropertiesofasinglethreadandtheinteractionsamongmultiplethreads.
Forsinglethreadproperties,weconsiderathread’scachedemand,memorybandwidthdemandandI/Ofrequency.Ad-ditionally,todescribehowthreadsutilizeprefetchers,weintroducethreemetrics:prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfraction.Wede?neathread’sprefetchereffectivenessasthepercentageoftheL2cachemissesthatarereducedwhentheprefetchersareturnedoncomparedtowhentheyareturnedoff.Wede?neathread’sprefetcherexcessivenessasthepercentageoftheadditionalcachelinesthatarebroughtintotheL2cachewhentheprefetchersareturnedonthanwhentheyareturnedoff.Prefetch/memoryfractionisde?nedasthefractionofprefetchingtransactionsinthetotalmemorytransactionswhenprefetchersareon.Prefetchereffectivenessmeasureshowmuchtheapplicationbene?tsfromtheprefetcher;prefetcherexces-sivenessmeasureshowmuchextrapressureisputonmemorybandwidthduetoprefetchingactivity;andprefetch/memoryfractionillustratestheoverallimpactofprefetchersonmemorybandwidth.
Formultiplethreadinteractions,weconsiderdatasharing,instructionsharing,andthefrequencyofsynchronizationoper-ations.Theseinteractionsusuallyhappenamongthreadsfromthesameapplication.Wecallsuchthreadssiblingthreads.
TABLEa2are
threadsfromapplication1andapplication2respectively.Eachapplica-tionhasfourthreads.Threadsfromthesameapplicationareassumedtohavesimilarcharacteristics.NotethatwedonotconsiderSMTthere,sowhentwothreadsarepinnedtoonecore,theysharethatcoreinatimemultiplexingmanner.
D.ThreadMappings
Inourexperiments,weexaminedallpossiblethreadmap-pingswhenrunningtwomulti-threadedapplicationseachwithfourthreads.Therefore,thereareeightthreadsintotal,http://wendang.chazidian.comingmorethreadsthancoresallowsthoroughevaluationoftheresources,includingL1caches,TLBs,branchpredictors,memorydisambiguationunitsandResourceCores.Moreover,becauserealapplicationthreadshavesynchronizationsandI/Ooperations,theycannotuseallofthecoresallocatedtothem.Thususingmorethreadsthancorescanimprovetheoverallprocessorutilization.
Toguideouranalysis,wechoosefourthreadmappingsthatcoverallresourcesharingcon?gurations(eithersiblingthreadssharearesourceornon-siblingthreadssharearesource).Allotherthreadmappingscouldbeviewedashybridversionsofthesefourmappings.TABLEIIshowsthefourthreadmappingsonaquad-coreprocessorrunningLinux.ExceptfortheOSmapping,allmappingsaredonebystaticallypinningthreadstocoresusingtheprocessoraf?nitysystemcall.Howthesefourthreadmappingsuseresourcesisdescribedbelow.
OSMapping(OSMap):ThisthreadmappingisdeterminedbytheLinuxscheduler.TheOStriestoevenlyspreadthethreadsacrossthecoresinthesystemtoensurefairprocessortimeallocationandlowprocessoridletime.Underthismapping,aslongasthereisanavailablecoreandarunnablethread,thatthreadismappedtorunonthatcore.Asaresult,anythreadcanrunonanycoreandsharetheresourcesassociatedwiththesecores.TheOSMapisusedasthebaselineforperformancecomparison.
Isolation-mapping(IsoMap):UnderIsoMap,siblingthreadsaremappedtorunonthetwocoresthatshareoneL2cache.Inotherwords,theyareisolatedonthatL2cache.L1caches,TLBs,L2caches,hardwareprefetchersandcoresaresharedbysiblings.
Interleaving-mapping(IntMap):UnderIntMap,threadsfromdifferentapplicationsaremappedtothecoresinaninterleavedfashion.L1caches,TLBsandcoresarestillonlysharedbysiblingthreads.L2cachesandprefetchersaresharedbythreadsfromdifferentapplications.
Spreading-mapping(SprMap):UnderSprMap,fourthreadsofeachapplicationareevenlyspreadonthefourcores.Asaresult,everycoreexecutestwothreadswhichcomefromtwodifferentapplications.L1caches,TLBs,L2cachesandcoresaresharedbythreadsfromtwoapplications.
Notethat,althoughweassumethatsiblingthreadsareidenticalhere,somePARSECbenchmarkshavesiblingthreads
withdifferentcharacteristics.However,weobservethatthedifferencebetweenthesiblingthreadsisnegligible.Therefore,threadmappingsthatonlydifferintheplacementofsiblingthreadsusuallyhavesimilarperformance.
III.KEYRESOURCESIDENTIFICATION
Thissectionidenti?esthekeyresourcesforthreadmappingalgorithms.Aresourceshouldsatisfytwocriteriatobeconsid-eredasakeyresourceforthreadmappingalgorithms:
1)Theutilizationofthisresourcevariesconsiderablyamongdifferentthreadmappings.
2)Threadmappingcausedutilizationvariationsofthisresourceresultinconsiderablevariationsinanapplication’sperformance.
CriteriononecaneasilybedetermineddirectlyusingPMUs.However,thesecondcriterionrequirestwoapproaches.Al-thoughexperimentingonarealmachineprovidesmoreaccurateunderstandingofthreadmappings,theabilitytopreciselyaccounteachresource’sperformanceimpactislimitedbythetypesofPMUsavailableinthehardware.Forexample,forbranchpredictors,therearePMUsthatcountthenumberofmispredictions,aswellasthenumberofcyclesstalledduetothesemispredictions.However,forL1D-cache,thereareonlyPMUsthatgivethenumberofL1D-cachemisses.ThereisnoPMUthattellsthenumberofcyclesspentonL1D-cachemisses.Therefore,fordifferentresources,differentapproacheshavetobetaken:
1)DirectApproachForresourcesthathavePMUstomeasuretheirperformanceimpacts,weusethereadingfromthesePMUsdirectly.
2)IndirectApproachForL1D-caches,L2cachesandoff-chipmemoryinterconnects,therearenoPMUstodirectlymeasuretheirperformanceimpact.Fortheseresources,we?rstverifywithPMUsthattheperformancevariationsacrossmappingsarecausedbymemorystalls.Thenwecomparetheperformanceofthethreadmappings.Iftheapplication’sper-formanceisimprovedinonemapping,andonlyoneresource’sutilizationisimprovedinthismapping,thenwecanconcludethatitisthisresourcethatcausestheperformanceimprovement.A.ExperimentalDesign
To?ndthekeyresources,weperformedexperimentsonarealCMPmachine.Hereweintroducetheexperimentaldesign.WeusePARSECbenchmarkssuiteversion2.1(withnativeinputset)tocreateourworkloadsbecausethesebenchmarkshavealargevarietyofthreadcharacteristics.TABLEIIIgivestherun-timecharacteristicsofPARSECbenchmarks.
InTableIII,datasharing,workingsetandsynchronizationoperationsarecollectedwithasimulatorbythePARSECauthors[4].Theamountofdatasharing(highorlow)referstothepercentage(highorlow)ofthecachelinesthataresharedamongsiblingthreads.Theworkingsethereisanarchitecturalconceptwhichmeansthetotalsizeofmemorytouchedbyabenchmark.Weuseworkingsetsizetoestimatethecachedemandofabenchmark.Synchronizationoperationsmeasuresthetotalnumberoflocks,barriersandconditionsexecutedbyabenchmark.AllothercharacteristicsarecollectedonanIntelQ9550processor.TheI/OtimeiscollectedbyinstrumentingtheI/Ofunctions.Bandwidthrequirementofabenchmarkisacquiredbydividingthetotalamountofmemoryaccessedbytheexecutiontimeofthebenchmark.Thetotalamountofmemoryaccessedequalstothetotalnumberofmemorytransactionstimesthesizeofeachtransaction,whichis64Bytes.Prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfractionsarecomputedfollowingtheirde?nitionsinSectionII-C.Thenegativevaluesofprefetchereffectivenessforswaptionsandx264suggestthatthesetwobenchmarksexperiencemoreL2cachemisseswhenhardwareprefetchersareturnedon.
Eachworkloadconsistsofapairofbenchmarks.Therefore,wecancomparethemappingswheresiblingthreadssharetheresourceswiththemappingswherenon-siblingthreadssharetheresources.WeuseninePARSECbenchmarksandthusthereare36pairs(workloads)intotal.Fourbenchmarks,ferret,dedup,freqmine,raytrace,arenotusedinouranalysisduetocompilationerrors,con?gurationerrorsorexecutionerrors.Foreachmapping,eachworkloadisexecuteduntilthelongestbenchmarkhas?nishedthreeruns.Shorterbenchmarksarerestartedifthelongestbenchmarkhasnot?nished.Theaverageoftheresultsofthe?rstthreerunsarepresented.Thevariationoftheresultsforthesamemappingandworkloadisverysmall.ForIsoMap,IntMap,andSprMap,thevariationislessthan2%.ForOSMap,thevariationishigher,usuallybetween2%and4%.However,sinceweonlyuseOSMapasabaseline,thehighervariationwouldnotaffectourconclusions.AllexperimentsareconductedonaplatformthathasanIntelquad-coreQ9550processor.Eachcoreofthisprocessorhasone32KBL1I-cacheandone32KBL1D-cache.Everytwocoresshareone6MBL2cache.(Fig.1).Thisplatformhas2GBmemoryandrunsLinuxkernel2.6.25.ReadingsfromPMUsarecollectedwithPerfMon2[12].
内容需要下载文档才能查看
Fig.2:AverageL1D-cachemisseseachbenchmarkexperiencesunder
thefourmappings.NormalizedtoOSMap.Resultofswaptionsisnotshownhere(norinthefollowingmemoryresources?gures)becauseithasveryfewmemoryaccesses.
Fig.4:Comparisonoftheperformanceofstreamclusterrunningwith
swaptionsunderIntMapandSprMap.Lowerbarisbetter.
Fig.5:Averageprocessorutilizationofeachbenchmarkrunningundertheformappings.NormalizedtoOSMap.
Fig.3:http://wendang.chazidian.comparingfourmappingsdoesnotchangetheconclusion.
B.L1D-Cache
We?rstevaluatetheimportanceofL1D-cache.Fig.2showsthenormalizedaverageL1D-cachemissesofeachbenchmarkunderthefourmappings.ForeachPARSECbenchmarkB,thereareeightworkloads(orpairs)thatcontainbenchmarkB.Foreachmapping,weruntheeightworkloads,andreadtheL1D-cachemissesofB.ThenwecomputetheaverageoftheL1D-cachemissesofBforeachmapping,andreporttheresultsinFig.2.Werepeatthesameprocessforallbenchmarksandallfourmappings.AsFig.2shows,L1D-cachemissesvaryfrom2%to14%,dependingonthethreadmapping.
WeevaluatedL1D-cache’simpactonperformancewiththeindirectapproach.Fig.3showstheCPUcyclesandmemoryresourceutilizationofx264runningwithstreamclusterunderIntMapandSprMap.Becausenootherresources’utilizationhavechangedfromonemappingtoanother,onlymemoryresourcesareshowninthe?gure.Fig.3showsthatalthoughIntMaphasmoreL2missesandhighermemorylatency,itsperformanceisstillbetterthanSprMapduetofewerL1misses.Therefore,threadmappinginducedvariationofL1D-cachemissescancauseconsiderableperformancevariation.
Inconclusion,L1D-cachemissesvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsiderableperformancevariation.Consequently,L1D-Cacheshouldbeconsideredasakeyresource.
C.L2Cache,HardwarePrefetchersandOff-chipMemoryInterconnect
Previousresearchhasdemonstratedthatthreadmappingcansigni?cantlyimpacttheutilizationofL2caches,hard-wareprefetchersandoff-chipmemoryinterconnect,andcon-sequentlyimpactapplicationperformance[6,17,22,26,29].Thus,theseresourcesshouldbeconsideredaskeyresources.Resultsofourexperimentscorroboratethisconclusion.How-
ever,unlikepreviousresearch,ourexperimentresultsshowthatthesememoryresourcesarebetterviewedasoneresourcebythreadmappingalgorithm,andweprovideanewmetriccalledL2MPforevaluatingtheiraggregatedimpact.ThedetaileddiscussiononthissubjectcanbefoundinSectionV-A.D.BranchPredictors
Inourexperiments,weobservethatonethreadmappingcouldhave15timesmorebranchmispredictionsthananothermapping,whichsuggeststhatbranchpredictors’mispredica-tionsvarysigni?cantlydependingonthreadmappings.
Weevaluatetheperformanceimpactofbranchpredictorswiththedirectapproach.Fig.4showstheperformanceofstreamclusterandswaptionsrunningtogether.Streamclusterconsumes48%moreCPUcyclesundertheSprMapthantheIntMap,andswaptionsconsumes8%moreCPUcyclesunderSprMap.Fig.4alsoshowsthat99%oftheincreasedCPUcyclesarecausedbybranchmispredictions.Therefore,thevariationofbranchmispredictionscanproduceconsiderableapplicationperformancevariation.
Inconclusion,branchmispredictionsvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsid-erableperformancevariation.Consequently,branchpredictorsshouldbeconsideredasakeyresource.
E.L1I-cache,I/DTLBsandMemoryDisambiguationUnitsDifferentthreadmappingshaveagreatimpactontheuti-lizationofL1I-caches,I/DTLBsandmemorydisambiguationunits.Onethreadmappingcanhavemorethantentimesmoremisses/mispredictionsfromtheseresourcesthananothermapping.Yettheabsoluteamountoftimespentinservingtheseextramisses/mispredictions(acquiredfromPMUsdirectly)accountsforlessthan2%(inmostcaseslessthan0.5%)ofthetotalexecutiontime.Therefore,weconcludethattheseresourcesshouldreceivelowprioritywhenmappingthreadsofthePARSECbenchmarks.
下载文档
热门试卷
- 2016年四川省内江市中考化学试卷
- 广西钦州市高新区2017届高三11月月考政治试卷
- 浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
- 浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
- 辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
- 广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
- 广西钦州市钦州港区2017届高三11月月考政治试卷
- 广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
- 广西钦州市高新区2016-2017学年高二11月月考政治试卷
- 广西钦州市高新区2016-2017学年高一11月月考政治试卷
- 山东省滨州市三校2017届第一学期阶段测试初三英语试题
- 四川省成都七中2017届高三一诊模拟考试文科综合试卷
- 2017届普通高等学校招生全国统一考试模拟试题(附答案)
- 重庆市永川中学高2017级上期12月月考语文试题
- 江西宜春三中2017届高三第一学期第二次月考文科综合试题
- 内蒙古赤峰二中2017届高三上学期第三次月考英语试题
- 2017年六年级(上)数学期末考试卷
- 2017人教版小学英语三年级上期末笔试题
- 江苏省常州西藏民族中学2016-2017学年九年级思想品德第一学期第二次阶段测试试卷
- 重庆市九龙坡区七校2016-2017学年上期八年级素质测查(二)语文学科试题卷
- 江苏省无锡市钱桥中学2016年12月八年级语文阶段性测试卷
- 江苏省无锡市钱桥中学2016-2017学年七年级英语12月阶段检测试卷
- 山东省邹城市第八中学2016-2017学年八年级12月物理第4章试题(无答案)
- 【人教版】河北省2015-2016学年度九年级上期末语文试题卷(附答案)
- 四川省简阳市阳安中学2016年12月高二月考英语试卷
- 四川省成都龙泉中学高三上学期2016年12月月考试题文科综合能力测试
- 安徽省滁州中学2016—2017学年度第一学期12月月考高三英语试卷
- 山东省武城县第二中学2016.12高一年级上学期第二次月考历史试题(必修一第四、五单元)
- 福建省四地六校联考2016-2017学年上学期第三次月考高三化学试卷
- 甘肃省武威第二十三中学2016—2017学年度八年级第一学期12月月考生物试卷
网友关注
- 偏高岭土在高强混凝土中的应用
- 设计桩承台需要注意的几个问题
- 厚层地下冰地段桥梁钻孔灌注桩基础试验研究综述
- 某工程cfg桩基础施工组织设计
- 公路钻孔灌注桩施工中一些常见问题的处治
- 成桩统计表c计量
- 4桩基础4.3-4.5
- 外墙外保温技术20110430
- 岩棉外墙保温工程项目管理的研究及运用
- 基坑锚索及腰梁施工方案(深基坑)
- 2桩基础工程[优质文档]
- 某项目高墩柱专项安全方案.doc
- 桩基础工程施工
- 桩施工方案2jsp
- 钻孔灌注桩施工工艺流程及质量控制
- CFG桩基础施工组织设计
- 碧水长天保温节能专项施工方案_secret
- 01墙体保温防火问题
- 十八届三中全会公报要点(整理版)
- 第02395章V2.0
- 塔吊四桩基础计算书
- 保温干挂计划[教学]
- 2011-2015年中国保温材料
- 保温监理细则1[教学]
- 常用桩基施工工艺及质量标准_1
- 工业设备与管道保温工程施工方法浅析
- 保温涂料施工计划[精品]
- 圣戈班外墙内保温与无机保温的对比分析
- 建筑外墙外保温技术在工程中的应用[精品资料]
- 2012骞磋亴鍦鸿皟鏌ユ姤鍛婏細閫氫俊钖祫鐭ュ灏?docx
网友关注视频
- 第8课 对称剪纸_第一课时(二等奖)(沪书画版二年级上册)_T3784187
- 《小学数学二年级下册》第二单元测试题讲解
- 冀教版小学数学二年级下册第二单元《租船问题》
- 外研版英语三起6年级下册(14版)Module3 Unit1
- 沪教版八年级下册数学练习册21.4(1)无理方程P18
- 第12章 圆锥曲线_12.7 抛物线的标准方程_第一课时(特等奖)(沪教版高二下册)_T274713
- 第五单元 民族艺术的瑰宝_16. 形形色色的民族乐器_第一课时(岭南版六年级上册)_T3751175
- 二年级下册数学第三课 搭一搭⚖⚖
- 【部编】人教版语文七年级下册《逢入京使》优质课教学视频+PPT课件+教案,安徽省
- 沪教版牛津小学英语(深圳用) 五年级下册 Unit 10
- 冀教版小学数学二年级下册第二周第2课时《我们的测量》宝丰街小学庞志荣
- 北师大版小学数学四年级下册第15课小数乘小数一
- 二年级下册数学第二课
- 沪教版牛津小学英语(深圳用) 六年级下册 Unit 7
- 化学九年级下册全册同步 人教版 第22集 酸和碱的中和反应(一)
- 沪教版牛津小学英语(深圳用) 五年级下册 Unit 7
- 沪教版八年级下册数学练习册一次函数复习题B组(P11)
- 冀教版小学数学二年级下册第二单元《有余数除法的整理与复习》
- 冀教版小学数学二年级下册第二单元《余数和除数的关系》
- 冀教版英语五年级下册第二课课程解读
- 冀教版小学数学二年级下册第二单元《有余数除法的简单应用》
- 沪教版八年级下次数学练习册21.4(2)无理方程P19
- 【获奖】科粤版初三九年级化学下册第七章7.3浓稀的表示
- 沪教版牛津小学英语(深圳用) 四年级下册 Unit 12
- 冀教版小学数学二年级下册第二单元《有余数除法的竖式计算》
- 冀教版小学数学二年级下册第二周第2课时《我们的测量》宝丰街小学庞志荣.mp4
- 【部编】人教版语文七年级下册《泊秦淮》优质课教学视频+PPT课件+教案,广东省
- 沪教版牛津小学英语(深圳用)五年级下册 Unit 1
- 30.3 由不共线三点的坐标确定二次函数_第一课时(市一等奖)(冀教版九年级下册)_T144342
- 冀教版小学英语五年级下册lesson2教学视频(2)
精品推荐
- 2016-2017学年高一语文人教版必修一+模块学业水平检测试题(含答案)
- 广西钦州市高新区2017届高三11月月考政治试卷
- 浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
- 浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
- 辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
- 广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
- 广西钦州市钦州港区2017届高三11月月考政治试卷
- 广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
- 广西钦州市高新区2016-2017学年高二11月月考政治试卷
- 广西钦州市高新区2016-2017学年高一11月月考政治试卷
分类导航
- 互联网
- 电脑基础知识
- 计算机软件及应用
- 计算机硬件及网络
- 计算机应用/办公自动化
- .NET
- 数据结构与算法
- Java
- SEO
- C/C++资料
- linux/Unix相关
- 手机开发
- UML理论/建模
- 并行计算/云计算
- 嵌入式开发
- windows相关
- 软件工程
- 管理信息系统
- 开发文档
- 图形图像
- 网络与通信
- 网络信息安全
- 电子支付
- Labview
- matlab
- 网络资源
- Python
- Delphi/Perl
- 评测
- Flash/Flex
- CSS/Script
- 计算机原理
- PHP资料
- 数据挖掘与模式识别
- Web服务
- 数据库
- Visual Basic
- 电子商务
- 服务器
- 搜索引擎优化
- 存储
- 架构
- 行业软件
- 人工智能
- 计算机辅助设计
- 多媒体
- 软件测试
- 计算机硬件与维护
- 网站策划/UE
- 网页设计/UI
- 网吧管理