Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources
上传者:戴文刚|上传时间:2015-05-07|密次下载
Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources
PerformanceAnalysisofThreadMappingswitha
HolisticViewoftheHardwareResources
WeiWang,TanimaDey,JasonMars,LingjiaTang,JackW.Davidson,andMaryLouSoffa
DepartmentofComputerScience
UniversityofVirginiaCharlottesville,VA22904
Email:{wwang,td8h,jom5x,lt8f,jwd,soffa}@virginia.edu
Abstract—Withtheshifttochipmultiprocessors,managingsharedresourceshasbecomeacriticalissueinrealizingtheirfullpotential.Previousresearchhasshownthatthreadmappingisapowerfultoolforresourcemanagement.However,thedif?cultyofsimultaneouslymanagingmultiplehardwareresourcesandthevaryingnatureoftheworkloadshaveimpededtheef?ciencyofthreadmappingalgorithms.Toovercomethedif?cultiesofsimultaneouslymanagingmultipleresourceswiththreadmapping,theinteractionbetweenvariousmicroarchitecturalresourcesandthreadcharacteristicsmustbewellunderstood.
Thispaperpresentsanin-depthanalysisofPARSECbench-marksrunningunderdifferentthreadmappingstoinvestigatetheinteractionofvariousthreadmappingswithmicroarchitecturalresourcesincluding,L1I/D-caches,I/DTLBs,L2caches,hardwareprefetchers,off-chipmemoryinterconnects,branchpredictors,memorydisambiguationunitsandthecores.Foreachresource,theanalysisprovidesguidelinesforhowtoimproveitsutilizationwhenmappingthreadswithdifferentcharacteristics.Wealsoanalyzehowtherelativeimportanceoftheresourcesvariesdependingontheworkloads.Ourexperimentsshowthatwhenonlymemoryresourcesareconsidered,threadmappingimprovesanapplica-tion’sperformancebyasmuchas14%overthedefaultLinuxscheduler.Incontrast,whenbothmemoryandprocessorresourcesareconsideredthemappingalgorithmachievesperformanceimprovementsbyasmuchas28%.Additionally,wedemonstratethatthreadmappingshouldconsiderL2caches,prefetchersandoff-chipmemoryinterconnectsasoneresource,andwepresentanewmetriccalledL2-misses-memory-latency-product(L2MP)forevaluatingtheiraggregatedperformanceimpact.
I.INTRODUCTION
Comparedtotraditionaluniprocessors,chipmultiprocessors(CMPs)greatlyimprovesystemthroughputbyofferingcom-putationalresourcesthatallowmultiplethreadstoexecuteinparallel.Torealizethefullpotentialofthesepowerfulplatforms,ef?cientlymanagingtheresourcesthataresharedbythesesimultaneouslyexecutingthreadshasbecomeacriticalissue.
Inthispaper,wefocusonmanagingCMPsharedresourcesthroughthreadmapping.Previousresearchhasshownthatthreadmappingisapowerfultoolformanagingresources[6,8,17,22,26].However,despitetheintensiveandextensiveresearchonthistopic,properlymappingthreadstoachievetheoptimalperformanceforanarbitraryworkloadisstillanopenquestion.Findingtheoptimalthreadmappingisextremelydif?cultbecauseonemustconsiderallrelevantresourcesandtheinteractionbetweenthesemanyresourcesisworkloaddependent.PreviousresearchhasshownthatL2caches,front-side-busandprefetchershavetobeconsideredwhenmanaging
memoryhierarchyresources[29].Howeveritremainsunclearwhetherthereareadditionalresourcesthatshouldbeconsid-ered,andhowtoholisticallyimprovetheirutilizationbasedontheworkloadcharacteristics.
Toholisticallymanagemultipleresourceswiththreadmap-ping,therearethreemajorchallenges.
1)The?rstchallengeistoidentifythekeyresourcesthatneedtobeconsideredbythreadmappingalgorithms.Neglectingthekeyresourceswouldresultinsuboptimalperformance.
2)Thesecondchallengeistodeterminehowtomapthreadstoimprovetheutilizationofeachkeyresource.Thebestthreadmappingalsodependsonthethreadrun-timecharacteristics.Foreachkeyresource,weneedtoidentifytherelatedthreadrun-timecharacteristicsanddeterminehowtomapthreadswhentheyexhibitthesecharacteristics.
3)Thethirdchallengeistohandlesituationswherenothreadmappingcanimprovetheutilizationofallkeyresources.Undersuchcircumstances,threadmappingalgorithmsmustprioritizetheresourcesandfocusonimprovingtheutilizationofresourcesthatcanprovidethemaximumbene?t.Previousresearchonthreadmappingfocusedonimprovingtheutilizationoftheresourceswithinthememoryhierar-chy[29]oronlyfocusonindividualresource[6,17,26].Althoughtheproposedapproachesaresuccessfulinimprov-ingtheutilizationoftheseresources,thebestapplicationperformanceisnotalwaysguaranteed[28].Moreover,mostpreviousworkhasbeendoneusingsingle-threadedworkloads,whileemergingworkloadsincreasinglyincludemulti-threadedprograms.Multi-threadedworkloadshavedifferentrun-timecharacteristics,thusrequiredifferentmappingstrategies.
Asa?rststeptowardsovercomingthechallengesofholis-ticallymanagingmultipleresources,weprovideanin-depthperformanceanalysisofallpossiblethreadmappingsforasetofworkloadscreatedusingapplicationsfromthemulti-threadedPARSECbenchmarksuite[4].Whileotherworkhaslookedatthememoryhierarchy,inthisworkwetakeaholisticlookatboththememoryresourcesandprocessorresources(e.g.,branchpredictors,memorydisambiguationunit,etc.),andeval-uatetheirrelativeimportance.Inthisanalysis,weidentifythekeyresourcesthatareresponsibleforperformancedifferences.
内容需要下载文档才能查看 内容需要下载文档才能查看
relatedthreadswiththesecharacteristicstoimprovetheutilizationofthekeyresources.Additionally,tohelpmaketrade-offdecisions,weanalyzetherelativeimportanceofthekeyresourcesforeachworkload,andinvestigatethereasonforprioritizingsomeresourcesovertheothers.Weobservethat,byfocusingonmultipleresources,properthreadmappingcanimproveanapplication’sperformancebyupto28%overcurrentLinuxscheduler,whileconsiderationofonlymemoryresourcesprovidesimprovementofonly14%.
Speci?cally,thecontributionsofthispaperinclude:
1)Anin-depthanalysisthatidenti?esthekeyhardwareresourcesthatmustbeconsideredbythreadmappingalgorithms,aswellasthelessimportantresourcesthatdonotneedtobeconsidered.Unlikepreviousworkthatconsideredonlysharedmemoryresourcesformappingsingle-threadedapplications,ourpaperdemonstratesthatformulti-threadedapplications,threadmappinghastoconsidermoreresources,andthreadcharacteristicsforbetterperformance.
2)Ananalysisofhowtoimproveeachkeyresource’sutilizationwiththreadmappingwhenmanagingthreadswithdifferentrun-timecharacteristics.Tothebestofourknowledge,thisanalysisisthe?rstthatinvestigatesthecharacteristicsofmulti-threadedworkloadsandtheirimplicationsformanagingbothmemoryandprocessorresourceswiththreadmapping.
3)AnanalysisshowsthatL2caches,prefetchersandmem-oryinterconnectionsshouldbeconsideredasoneresourcebecauseoftheircomplexinteractions.WealsoproposeanewmetricL2-misses-memory-latency-product(L2MP)tomeasuretheiraggregatedperformanceimpact.
4)Ananalysisthatidenti?estherankingofthekeyre-sourcesforeachworkload,andthereasonfortheranking.Theremainderofthispaperisorganizedasfollows:Sec-tionIIprovidesanoverviewofhardwareresourcesandthethreadcharacteristicsconsideredinouranalysis.SectionIIIidenti?esthekeyresourcesforthreadmappingalgorithms.SectionIVanalyzeshowtoimprovetheutilizationofindividualresourcesviathreadmapping.SectionVdiscussesusingthere-sourcerankingstosimultaneouslymanagingmultipleresources.SectionVIsummarizestheperformanceresults.SectionVIIdiscussesrelatedworkandSectionVIIIconcludesthepaper.II.PERFORMANCEANALYSISOVERVIEW
Toaddressthechallengesmentionedabove,weperformacomprehensiveanalysisofhowanapplication’sperformanceiseffectedwhenthreadswithvariouscharacteristicssharemultiplehardwareresources.Thissectiongivesanoverviewoftheresources,themetrics,therun-timecharacteristics,andthethreadmappingsthatarecoveredinthisanalysis.A.HardwareResources
WeaddresstheresourcesthatarecommonlyavailableoncurrentCMPprocessors.Fig.1givesaschematicviewoftheresourcesprovidedbyanIntelquad-coreprocessor.The
resourcesweconsidercanbeclassi?edintotwocategories:thememoryhierarchyresourcesandtheprocessorresources.
MemoryhierarchyresourcesincludeL1instructioncaches(I-cache),L1datacaches(D-cache),instructionanddatatranslationlook-asidebuffers(I/DTLB),L2caches,hardwareprefetchersandoff-chipmemoryinterconnect.WeuseanIntelCore2processor,whichhastwoprefetchmechanisms,DataPrefetchLogic(DPL)andL2StreamingPrefetch[16].DPLfetchesastreamofinstructionsanddatafrommemoryifastridememoryaccesspatternisdetected.L2StreamingPrefetchbringstwoadjacentcachelinesintoanL2cache.
Processorresourcesincludethecoresandcomponentsthatrequiretrainingtofunction,suchasbranchpredictorsandmemorydisambiguationunits.Inthispaper,weusethetermResourceCoreswhenwediscussthecoreasaresource.B.Metrics
Thenumberofcachemisses,theamountofmemorytrans-actionsandmemoryaccesslatencyareusedtoevaluatetheutilizationofthememoryhierarchyresources.ThenumberofmispredictionsandstalledCPUcyclesduetothesemis-predictionsareusedtoevaluatethetraining-basedprocessorresources.
ProcessorutilizationisusedtoevaluatetheutilizationofResourceCores.Notethat,thisprocessorutilizationisviewedfromtheOSperspective.Forexample,supposethereisonethreadthatrunssolelyonacore.DuetoI/Ooperationsorsynchronizations,halfoftheexecutiontimeofthisthreadisstayingintheOSwaitingqueue.Thenthecorethatexecutesthisthreadhasaprocessorutilizationofonly50%.Inthispaper,theprocessorutilizationreferstotheoverallprocessorutilizationofallcoresinthesystem.
Weusetwometricstoevaluatetheperformanceoftheapplications:thenumberofCPUcyclesconsumedandtheexe-cutiontime.Memoryresourcesandthetraining-basedprocessorresourcesimpactthetotalcyclesconsumedbyanapplication.Accordingly,weuseexecutedCPUcyclestoestimatetheperformanceimpactoftheseresources.Theexecutiontime,ontheotherhand,isaffectedbybothprocessorutilizationandtheexecutedcycles.Weuseexecutiontimewhenevaluatingtheperformanceimpactofalltheresources.Therelationofexecutiontime,executedcyclesandprocessorutilizationisdescribedbyequation(1).
ExecTime=
ExecCycles
Num×Proc×Frequency
(1)
NumandFrequencyrefertothenumberofcoresandtheprocessorfrequency,respectively.Supposethereisanapplicationwithfourthreadsrunningonaquad-coreproces-sorof1GHz.Eachofthefourthreadrequires500million
CPUcyclestoexecute,andeachthreadhas50%processorutilizationduetoI/Ooperationsandsynchronizations.Thenforthisapplication,itsexecutiontimeis(500M(cycles)×4(threads))/(4(cores)×50%×1GHz)=1second.
Allofthemetricsmentionedinthissectioncanbeacquiredfromperformancemonitoringunits(PMUs)[16].TABLEIgivesthenameofthePMUsweusedinouranalysis.
WecomputememoryaccesslatencyfromPMUsusingtheequation(2)proposedbyEranian[13].Essentially,inequa-tion2,memorylatencyiscomputedbydividingthetotalcyclesofallmemoryreadtransactionsbythenumberofmemoryreads.
Memlatency=
BUSSTANDING
BUSRANSBRD?BUSRANSIFETCH
(2)
C.ThreadCharacteristics
Threadcharacteristicsincludethepropertiesofasinglethreadandtheinteractionsamongmultiplethreads.
Forsinglethreadproperties,weconsiderathread’scachedemand,memorybandwidthdemandandI/Ofrequency.Ad-ditionally,todescribehowthreadsutilizeprefetchers,weintroducethreemetrics:prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfraction.Wede?neathread’sprefetchereffectivenessasthepercentageoftheL2cachemissesthatarereducedwhentheprefetchersareturnedoncomparedtowhentheyareturnedoff.Wede?neathread’sprefetcherexcessivenessasthepercentageoftheadditionalcachelinesthatarebroughtintotheL2cachewhentheprefetchersareturnedonthanwhentheyareturnedoff.Prefetch/memoryfractionisde?nedasthefractionofprefetchingtransactionsinthetotalmemorytransactionswhenprefetchersareon.Prefetchereffectivenessmeasureshowmuchtheapplicationbene?tsfromtheprefetcher;prefetcherexces-sivenessmeasureshowmuchextrapressureisputonmemorybandwidthduetoprefetchingactivity;andprefetch/memoryfractionillustratestheoverallimpactofprefetchersonmemorybandwidth.
Formultiplethreadinteractions,weconsiderdatasharing,instructionsharing,andthefrequencyofsynchronizationoper-ations.Theseinteractionsusuallyhappenamongthreadsfromthesameapplication.Wecallsuchthreadssiblingthreads.
TABLEa2are
threadsfromapplication1andapplication2respectively.Eachapplica-tionhasfourthreads.Threadsfromthesameapplicationareassumedtohavesimilarcharacteristics.NotethatwedonotconsiderSMTthere,sowhentwothreadsarepinnedtoonecore,theysharethatcoreinatimemultiplexingmanner.
D.ThreadMappings
Inourexperiments,weexaminedallpossiblethreadmap-pingswhenrunningtwomulti-threadedapplicationseachwithfourthreads.Therefore,thereareeightthreadsintotal,http://wendang.chazidian.comingmorethreadsthancoresallowsthoroughevaluationoftheresources,includingL1caches,TLBs,branchpredictors,memorydisambiguationunitsandResourceCores.Moreover,becauserealapplicationthreadshavesynchronizationsandI/Ooperations,theycannotuseallofthecoresallocatedtothem.Thususingmorethreadsthancorescanimprovetheoverallprocessorutilization.
Toguideouranalysis,wechoosefourthreadmappingsthatcoverallresourcesharingcon?gurations(eithersiblingthreadssharearesourceornon-siblingthreadssharearesource).Allotherthreadmappingscouldbeviewedashybridversionsofthesefourmappings.TABLEIIshowsthefourthreadmappingsonaquad-coreprocessorrunningLinux.ExceptfortheOSmapping,allmappingsaredonebystaticallypinningthreadstocoresusingtheprocessoraf?nitysystemcall.Howthesefourthreadmappingsuseresourcesisdescribedbelow.
OSMapping(OSMap):ThisthreadmappingisdeterminedbytheLinuxscheduler.TheOStriestoevenlyspreadthethreadsacrossthecoresinthesystemtoensurefairprocessortimeallocationandlowprocessoridletime.Underthismapping,aslongasthereisanavailablecoreandarunnablethread,thatthreadismappedtorunonthatcore.Asaresult,anythreadcanrunonanycoreandsharetheresourcesassociatedwiththesecores.TheOSMapisusedasthebaselineforperformancecomparison.
Isolation-mapping(IsoMap):UnderIsoMap,siblingthreadsaremappedtorunonthetwocoresthatshareoneL2cache.Inotherwords,theyareisolatedonthatL2cache.L1caches,TLBs,L2caches,hardwareprefetchersandcoresaresharedbysiblings.
Interleaving-mapping(IntMap):UnderIntMap,threadsfromdifferentapplicationsaremappedtothecoresinaninterleavedfashion.L1caches,TLBsandcoresarestillonlysharedbysiblingthreads.L2cachesandprefetchersaresharedbythreadsfromdifferentapplications.
Spreading-mapping(SprMap):UnderSprMap,fourthreadsofeachapplicationareevenlyspreadonthefourcores.Asaresult,everycoreexecutestwothreadswhichcomefromtwodifferentapplications.L1caches,TLBs,L2cachesandcoresaresharedbythreadsfromtwoapplications.
Notethat,althoughweassumethatsiblingthreadsareidenticalhere,somePARSECbenchmarkshavesiblingthreads
withdifferentcharacteristics.However,weobservethatthedifferencebetweenthesiblingthreadsisnegligible.Therefore,threadmappingsthatonlydifferintheplacementofsiblingthreadsusuallyhavesimilarperformance.
III.KEYRESOURCESIDENTIFICATION
Thissectionidenti?esthekeyresourcesforthreadmappingalgorithms.Aresourceshouldsatisfytwocriteriatobeconsid-eredasakeyresourceforthreadmappingalgorithms:
1)Theutilizationofthisresourcevariesconsiderablyamongdifferentthreadmappings.
2)Threadmappingcausedutilizationvariationsofthisresourceresultinconsiderablevariationsinanapplication’sperformance.
CriteriononecaneasilybedetermineddirectlyusingPMUs.However,thesecondcriterionrequirestwoapproaches.Al-thoughexperimentingonarealmachineprovidesmoreaccurateunderstandingofthreadmappings,theabilitytopreciselyaccounteachresource’sperformanceimpactislimitedbythetypesofPMUsavailableinthehardware.Forexample,forbranchpredictors,therearePMUsthatcountthenumberofmispredictions,aswellasthenumberofcyclesstalledduetothesemispredictions.However,forL1D-cache,thereareonlyPMUsthatgivethenumberofL1D-cachemisses.ThereisnoPMUthattellsthenumberofcyclesspentonL1D-cachemisses.Therefore,fordifferentresources,differentapproacheshavetobetaken:
1)DirectApproachForresourcesthathavePMUstomeasuretheirperformanceimpacts,weusethereadingfromthesePMUsdirectly.
2)IndirectApproachForL1D-caches,L2cachesandoff-chipmemoryinterconnects,therearenoPMUstodirectlymeasuretheirperformanceimpact.Fortheseresources,we?rstverifywithPMUsthattheperformancevariationsacrossmappingsarecausedbymemorystalls.Thenwecomparetheperformanceofthethreadmappings.Iftheapplication’sper-formanceisimprovedinonemapping,andonlyoneresource’sutilizationisimprovedinthismapping,thenwecanconcludethatitisthisresourcethatcausestheperformanceimprovement.A.ExperimentalDesign
To?ndthekeyresources,weperformedexperimentsonarealCMPmachine.Hereweintroducetheexperimentaldesign.WeusePARSECbenchmarkssuiteversion2.1(withnativeinputset)tocreateourworkloadsbecausethesebenchmarkshavealargevarietyofthreadcharacteristics.TABLEIIIgivestherun-timecharacteristicsofPARSECbenchmarks.
InTableIII,datasharing,workingsetandsynchronizationoperationsarecollectedwithasimulatorbythePARSECauthors[4].Theamountofdatasharing(highorlow)referstothepercentage(highorlow)ofthecachelinesthataresharedamongsiblingthreads.Theworkingsethereisanarchitecturalconceptwhichmeansthetotalsizeofmemorytouchedbyabenchmark.Weuseworkingsetsizetoestimatethecachedemandofabenchmark.Synchronizationoperationsmeasuresthetotalnumberoflocks,barriersandconditionsexecutedbyabenchmark.AllothercharacteristicsarecollectedonanIntelQ9550processor.TheI/OtimeiscollectedbyinstrumentingtheI/Ofunctions.Bandwidthrequirementofabenchmarkisacquiredbydividingthetotalamountofmemoryaccessedbytheexecutiontimeofthebenchmark.Thetotalamountofmemoryaccessedequalstothetotalnumberofmemorytransactionstimesthesizeofeachtransaction,whichis64Bytes.Prefetchereffectiveness,prefetcherexcessivenessandprefetch/memoryfractionsarecomputedfollowingtheirde?nitionsinSectionII-C.Thenegativevaluesofprefetchereffectivenessforswaptionsandx264suggestthatthesetwobenchmarksexperiencemoreL2cachemisseswhenhardwareprefetchersareturnedon.
Eachworkloadconsistsofapairofbenchmarks.Therefore,wecancomparethemappingswheresiblingthreadssharetheresourceswiththemappingswherenon-siblingthreadssharetheresources.WeuseninePARSECbenchmarksandthusthereare36pairs(workloads)intotal.Fourbenchmarks,ferret,dedup,freqmine,raytrace,arenotusedinouranalysisduetocompilationerrors,con?gurationerrorsorexecutionerrors.Foreachmapping,eachworkloadisexecuteduntilthelongestbenchmarkhas?nishedthreeruns.Shorterbenchmarksarerestartedifthelongestbenchmarkhasnot?nished.Theaverageoftheresultsofthe?rstthreerunsarepresented.Thevariationoftheresultsforthesamemappingandworkloadisverysmall.ForIsoMap,IntMap,andSprMap,thevariationislessthan2%.ForOSMap,thevariationishigher,usuallybetween2%and4%.However,sinceweonlyuseOSMapasabaseline,thehighervariationwouldnotaffectourconclusions.AllexperimentsareconductedonaplatformthathasanIntelquad-coreQ9550processor.Eachcoreofthisprocessorhasone32KBL1I-cacheandone32KBL1D-cache.Everytwocoresshareone6MBL2cache.(Fig.1).Thisplatformhas2GBmemoryandrunsLinuxkernel2.6.25.ReadingsfromPMUsarecollectedwithPerfMon2[12].
内容需要下载文档才能查看
Fig.2:AverageL1D-cachemisseseachbenchmarkexperiencesunder
thefourmappings.NormalizedtoOSMap.Resultofswaptionsisnotshownhere(norinthefollowingmemoryresources?gures)becauseithasveryfewmemoryaccesses.
Fig.4:Comparisonoftheperformanceofstreamclusterrunningwith
swaptionsunderIntMapandSprMap.Lowerbarisbetter.
Fig.5:Averageprocessorutilizationofeachbenchmarkrunningundertheformappings.NormalizedtoOSMap.
Fig.3:http://wendang.chazidian.comparingfourmappingsdoesnotchangetheconclusion.
B.L1D-Cache
We?rstevaluatetheimportanceofL1D-cache.Fig.2showsthenormalizedaverageL1D-cachemissesofeachbenchmarkunderthefourmappings.ForeachPARSECbenchmarkB,thereareeightworkloads(orpairs)thatcontainbenchmarkB.Foreachmapping,weruntheeightworkloads,andreadtheL1D-cachemissesofB.ThenwecomputetheaverageoftheL1D-cachemissesofBforeachmapping,andreporttheresultsinFig.2.Werepeatthesameprocessforallbenchmarksandallfourmappings.AsFig.2shows,L1D-cachemissesvaryfrom2%to14%,dependingonthethreadmapping.
WeevaluatedL1D-cache’simpactonperformancewiththeindirectapproach.Fig.3showstheCPUcyclesandmemoryresourceutilizationofx264runningwithstreamclusterunderIntMapandSprMap.Becausenootherresources’utilizationhavechangedfromonemappingtoanother,onlymemoryresourcesareshowninthe?gure.Fig.3showsthatalthoughIntMaphasmoreL2missesandhighermemorylatency,itsperformanceisstillbetterthanSprMapduetofewerL1misses.Therefore,threadmappinginducedvariationofL1D-cachemissescancauseconsiderableperformancevariation.
Inconclusion,L1D-cachemissesvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsiderableperformancevariation.Consequently,L1D-Cacheshouldbeconsideredasakeyresource.
C.L2Cache,HardwarePrefetchersandOff-chipMemoryInterconnect
Previousresearchhasdemonstratedthatthreadmappingcansigni?cantlyimpacttheutilizationofL2caches,hard-wareprefetchersandoff-chipmemoryinterconnect,andcon-sequentlyimpactapplicationperformance[6,17,22,26,29].Thus,theseresourcesshouldbeconsideredaskeyresources.Resultsofourexperimentscorroboratethisconclusion.How-
ever,unlikepreviousresearch,ourexperimentresultsshowthatthesememoryresourcesarebetterviewedasoneresourcebythreadmappingalgorithm,andweprovideanewmetriccalledL2MPforevaluatingtheiraggregatedimpact.ThedetaileddiscussiononthissubjectcanbefoundinSectionV-A.D.BranchPredictors
Inourexperiments,weobservethatonethreadmappingcouldhave15timesmorebranchmispredictionsthananothermapping,whichsuggeststhatbranchpredictors’mispredica-tionsvarysigni?cantlydependingonthreadmappings.
Weevaluatetheperformanceimpactofbranchpredictorswiththedirectapproach.Fig.4showstheperformanceofstreamclusterandswaptionsrunningtogether.Streamclusterconsumes48%moreCPUcyclesundertheSprMapthantheIntMap,andswaptionsconsumes8%moreCPUcyclesunderSprMap.Fig.4alsoshowsthat99%oftheincreasedCPUcyclesarecausedbybranchmispredictions.Therefore,thevariationofbranchmispredictionscanproduceconsiderableapplicationperformancevariation.
Inconclusion,branchmispredictionsvarydependingonthreadmappings.Furthermore,thisvariationcancauseconsid-erableperformancevariation.Consequently,branchpredictorsshouldbeconsideredasakeyresource.
E.L1I-cache,I/DTLBsandMemoryDisambiguationUnitsDifferentthreadmappingshaveagreatimpactontheuti-lizationofL1I-caches,I/DTLBsandmemorydisambiguationunits.Onethreadmappingcanhavemorethantentimesmoremisses/mispredictionsfromtheseresourcesthananothermapping.Yettheabsoluteamountoftimespentinservingtheseextramisses/mispredictions(acquiredfromPMUsdirectly)accountsforlessthan2%(inmostcaseslessthan0.5%)ofthetotalexecutiontime.Therefore,weconcludethattheseresourcesshouldreceivelowprioritywhenmappingthreadsofthePARSECbenchmarks.
下载文档
热门试卷
- 2016年四川省内江市中考化学试卷
- 广西钦州市高新区2017届高三11月月考政治试卷
- 浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
- 浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
- 辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
- 广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
- 广西钦州市钦州港区2017届高三11月月考政治试卷
- 广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
- 广西钦州市高新区2016-2017学年高二11月月考政治试卷
- 广西钦州市高新区2016-2017学年高一11月月考政治试卷
- 山东省滨州市三校2017届第一学期阶段测试初三英语试题
- 四川省成都七中2017届高三一诊模拟考试文科综合试卷
- 2017届普通高等学校招生全国统一考试模拟试题(附答案)
- 重庆市永川中学高2017级上期12月月考语文试题
- 江西宜春三中2017届高三第一学期第二次月考文科综合试题
- 内蒙古赤峰二中2017届高三上学期第三次月考英语试题
- 2017年六年级(上)数学期末考试卷
- 2017人教版小学英语三年级上期末笔试题
- 江苏省常州西藏民族中学2016-2017学年九年级思想品德第一学期第二次阶段测试试卷
- 重庆市九龙坡区七校2016-2017学年上期八年级素质测查(二)语文学科试题卷
- 江苏省无锡市钱桥中学2016年12月八年级语文阶段性测试卷
- 江苏省无锡市钱桥中学2016-2017学年七年级英语12月阶段检测试卷
- 山东省邹城市第八中学2016-2017学年八年级12月物理第4章试题(无答案)
- 【人教版】河北省2015-2016学年度九年级上期末语文试题卷(附答案)
- 四川省简阳市阳安中学2016年12月高二月考英语试卷
- 四川省成都龙泉中学高三上学期2016年12月月考试题文科综合能力测试
- 安徽省滁州中学2016—2017学年度第一学期12月月考高三英语试卷
- 山东省武城县第二中学2016.12高一年级上学期第二次月考历史试题(必修一第四、五单元)
- 福建省四地六校联考2016-2017学年上学期第三次月考高三化学试卷
- 甘肃省武威第二十三中学2016—2017学年度八年级第一学期12月月考生物试卷
网友关注
- 教资国考|小学《教育教学知识与能力》高频考点精髓:心理评估
- 教资国考|小学《教育教学知识与能力》高频考点精髓:班级管理的方法
- 教资国考|小学《教育教学知识与能力》高频考点精髓:德育过程的基本规律
- 教资国考|小学《教育教学知识与能力》高频考点精髓:知觉
- 教资国考|小学《教育教学知识与能力》高频考点精髓:小学德育的途径与方法
- 教资国考|小学《教育教学知识与能力》高频考点精髓:教学重难点设计
- 教资国考小学《教育教学知识与能力》语文教学设计命题趋势
- 教资国考|小学《教育教学知识与能力》高频考点精髓:班级管理中的问题及应对
- 教资国考|小学《教育教学知识与能力》高频考点精髓:科尔伯格的道德发展理论
- 教师资格统考小学科目二教学设计答题技巧
- 2015下半年全国教师资格小学教育教学知识与能力备考指导
- 教资国考|小学《教育教学知识与能力》高频考点精髓:小学语文课程的阶段目标
- 教资国考|小学《教育教学知识与能力》高频考点精髓:儿童身心发展概述
- 教资国考|小学《教育教学知识与能力》高频考点精髓:人本主义学习理论
- 教资国考|小学《教育教学知识与能力》高频考点精髓:感觉
- 教资国考|小学《教育教学知识与能力》高频考点精髓:教育调查法
- 教资国考|小学《教育教学知识与能力》高频考点精髓:班级管理的原则
- 教资国考|小学《教育教学知识与能力》高频考点精髓:基础教育课改目标、实施状况
- 教资国考|小学《教育教学知识与能力》高频考点精髓:教学目标设计
- 教资国考|小学《教育教学知识与能力》高频考点精髓:个体差异与因材施教
- 教资国考|小学《教育教学知识与能力》高频考点精髓:注意
- 教资国考|小学《教育教学知识与能力》高频考点精髓:教学设计的步骤与内容
- 教资国考|小学《教育教学知识与能力》高频考点精髓:小学生心理发展的特点
- 教资国考|小学《教育教学知识与能力》高频考点精髓:建构主义学习理论
- 教资国考|小学《教育教学知识与能力》高频考点精髓:班集体的发展阶段
- 教资国考|小学《教育教学知识与能力》高频考点精髓:教师专业发展的内容、内涵
- 教资国考|小学《教育教学知识与能力》高频考点精髓:皮亚杰的道德发展理论
- 教资国考|小学《教育教学知识与能力》高频考点精髓:小学教育科研成果的表述
- 教资国考|小学《教育教学知识与能力》高频考点精髓:班级活动的基本类型
- 教资国考|小学《教育教学知识与能力》高频考点精髓:班级管理的模式
网友关注视频
- 冀教版小学数学二年级下册第二单元《有余数除法的整理与复习》
- 第五单元 民族艺术的瑰宝_16. 形形色色的民族乐器_第一课时(岭南版六年级上册)_T3751175
- 六年级英语下册上海牛津版教材讲解 U1单词
- 北师大版数学四年级下册3.4包装
- 第五单元 民族艺术的瑰宝_15. 多姿多彩的民族服饰_第二课时(市一等奖)(岭南版六年级上册)_T129830
- 第12章 圆锥曲线_12.7 抛物线的标准方程_第一课时(特等奖)(沪教版高二下册)_T274713
- 沪教版牛津小学英语(深圳用) 五年级下册 Unit 10
- 外研版英语三起5年级下册(14版)Module3 Unit2
- 化学九年级下册全册同步 人教版 第18集 常见的酸和碱(二)
- 化学九年级下册全册同步 人教版 第22集 酸和碱的中和反应(一)
- 化学九年级下册全册同步 人教版 第25集 生活中常见的盐(二)
- 30.3 由不共线三点的坐标确定二次函数_第一课时(市一等奖)(冀教版九年级下册)_T144342
- 冀教版小学数学二年级下册第二单元《有余数除法的竖式计算》
- 19 爱护鸟类_第一课时(二等奖)(桂美版二年级下册)_T3763925
- 8.练习八_第一课时(特等奖)(苏教版三年级上册)_T142692
- 沪教版牛津小学英语(深圳用) 四年级下册 Unit 8
- 【部编】人教版语文七年级下册《过松源晨炊漆公店(其五)》优质课教学视频+PPT课件+教案,辽宁省
- 【部编】人教版语文七年级下册《泊秦淮》优质课教学视频+PPT课件+教案,湖北省
- 沪教版牛津小学英语(深圳用) 四年级下册 Unit 2
- 外研版英语三起5年级下册(14版)Module3 Unit1
- 【部编】人教版语文七年级下册《逢入京使》优质课教学视频+PPT课件+教案,安徽省
- 沪教版牛津小学英语(深圳用) 五年级下册 Unit 12
- 苏科版八年级数学下册7.2《统计图的选用》
- 冀教版小学数学二年级下册1
- 三年级英语单词记忆下册(沪教版)第一二单元复习
- 苏教版二年级下册数学《认识东、南、西、北》
- 冀教版小学数学二年级下册第二单元《余数和除数的关系》
- 【部编】人教版语文七年级下册《过松源晨炊漆公店(其五)》优质课教学视频+PPT课件+教案,江苏省
- 【部编】人教版语文七年级下册《老山界》优质课教学视频+PPT课件+教案,安徽省
- 外研版英语七年级下册module1unit3名词性物主代词讲解
精品推荐
- 2016-2017学年高一语文人教版必修一+模块学业水平检测试题(含答案)
- 广西钦州市高新区2017届高三11月月考政治试卷
- 浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
- 浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
- 辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
- 广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
- 广西钦州市钦州港区2017届高三11月月考政治试卷
- 广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
- 广西钦州市高新区2016-2017学年高二11月月考政治试卷
- 广西钦州市高新区2016-2017学年高一11月月考政治试卷
分类导航
- 互联网
- 电脑基础知识
- 计算机软件及应用
- 计算机硬件及网络
- 计算机应用/办公自动化
- .NET
- 数据结构与算法
- Java
- SEO
- C/C++资料
- linux/Unix相关
- 手机开发
- UML理论/建模
- 并行计算/云计算
- 嵌入式开发
- windows相关
- 软件工程
- 管理信息系统
- 开发文档
- 图形图像
- 网络与通信
- 网络信息安全
- 电子支付
- Labview
- matlab
- 网络资源
- Python
- Delphi/Perl
- 评测
- Flash/Flex
- CSS/Script
- 计算机原理
- PHP资料
- 数据挖掘与模式识别
- Web服务
- 数据库
- Visual Basic
- 电子商务
- 服务器
- 搜索引擎优化
- 存储
- 架构
- 行业软件
- 人工智能
- 计算机辅助设计
- 多媒体
- 软件测试
- 计算机硬件与维护
- 网站策划/UE
- 网页设计/UI
- 网吧管理