Object Detectors Emerge in Deep Scene CNNs
上传者:刘敏层|上传时间:2015-04-25|密次下载
Object Detectors Emerge in Deep Scene CNNs
深度卷积网络中有检测器
Published as a conference paper at ICLR 2015 O BJECT DETECTORS EMERGE IN D EEP S CENE CNN S Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba Computer Science and Arti?cial Intelligence Laboratory, MIT {bolei,khosla,agata,oliva,torralba}@mit.edu A BSTRACT arXiv:1412.6856v2 [cs.CV] 15 Apr 2015 With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classi?cation. As scenes are composed of objects, the CNN for scene classi?cation automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning to recognize scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects. 1 I NTRODUCTION Current deep neural networks achieve remarkable performance at a number of vision tasks surpassing techniques based on hand-crafted features. However, while the structure of the representation in hand-crafted features is often clear and interpretable, in the case of deep networks it remains unclear what the nature of the learned representation is and why it works so well. A convolutional neural network (CNN) trained on ImageNet (Deng et al., 2009) signi?cantly outperforms the best hand crafted features on the ImageNet challenge (Russakovsky et al., 2014). But more surprisingly, the same network, when used as a generic feature extractor, is also very successful at other tasks like object detection on the PASCAL VOC dataset (Everingham et al., 2010). A number of works have focused on understanding the representation learned by CNNs. The work by Zeiler & Fergus (2014) introduces a procedure to visualize what activates each unit. Recently Yosinski et al. (2014) use transfer learning to measure how generic/speci?c the learned features are. In Agrawal et al. (2014) and Szegedy et al. (2013), they suggest that the CNN for ImageNet learns a distributed code for objects. They all use ImageNet, an object-centric dataset, as a training set. When training a CNN to distinguish different object classes, it is unclear what the underlying representation should be. Objects have often been described using part-based representations where parts can be shared across objects, forming a distributed code. However, what those parts should be is unclear. For instance, one would think that the meaningful parts of a face are the mouth, the two eyes, and the nose. However, those are simply function
深度卷积网络中有检测器
al parts, with words associated with them; the object parts that are important for visual recognition might be different from these semantic parts, making it dif?cult to evaluate how ef?cient a representation is. In fact, the strong internal con?guration of objects makes the de?nition of what is a useful part poorly constrained: an algorithm can ?nd different and arbitrary part con?gurations, all giving similar recognition performance. Learning to classify scenes (i.e., classifying an image as being an of?ce, a restaurant, a street, etc) using the Places dataset (Zhou et al., 2014) gives the opportunity to study the internal representation learned by a CNN on a task other than object recognition. In the case of scenes, the representation is clearer. Scene categories are de?ned by the objects they contain and, to some extent, by the spatial con?guration of those objects. For instance, the important parts of a bedroom are the bed, a side table, a lamp, a cabinet, as well as the walls, ?oor and ceiling. Objects represent therefore a distributed code for scenes (i.e., object classes are shared across different scene categories). Importantly, in scenes, the spatial con?guration of objects, 1 Published as a conference paper at ICLR 2015 Table 1: The parameters of the network architecture used for ImageNet-CNN and Places-CNN. Layer Units Feature conv1 96 55×55 pool1 96 27×27 conv2 256 27×27 pool2 256 13×13 conv3 384 13×13 conv4 384 13×13 conv5 256 13×13 pool5 256 6×6 fc6 4096 1 fc7 4096 1 although compact, has a much larger degree of freedom. It is this loose spatial dependency that, we believe, makes scene representation different from most object classes (most object classes do not have a loose interaction between parts). In addition to objects, other feature regularities of scene categories allow for other representations to emerge, such as textures (Renninger & Malik, 2004), GIST (Oliva & Torralba, 2006), bag-of-words (Lazebnik et al., 2006), part-based models (Pandey & Lazebnik, 2011), and ObjectBank (Li et al., 2010). While a CNN has enough ?exibility to learn any of those representations, if meaningful objects emerge without supervision inside the inner layers of the CNN, there will be little ambiguity as to which type of representation these networks are learning. The main contribution of this paper is to show that object detection emerges inside a CNN trained to recognize scenes, even more than when trained with ImageNet. This is surprising because our results demonstrate that reliable object detectors are found even though, unlike ImageNet, no supervision is provided for objects. Although object discovery with deep neural networks has been shown before in an unsupervised setting (Le, 2013), here we ?nd that many more objects can be naturally discovered, in a supervised setting tuned to scene classi?cation rather than object classi?cation. Importantly, the emergence of object detectors inside the CNN su
深度卷积网络中有检测器
ggests that a single network can support recognition at several levels of abstraction (e.g., edges, texture, objects, and scenes) without needing multiple outputs or a collection of networks. Whereas other works have shown that one can detect objects by applying the network multiple times in different locations (Girshick et al., 2014), or focusing attention (Tang et al., 2014), or by doing segmentation (Grangier et al., 2009; Farabet et al., 2013), here we show that the same network can do both object localization and scene recognition in a single forward-pass. Another set of recent works (Oquab et al., 2014; Bergamo et al., 2014) demonstrate the ability of deep networks trained on object classi?cation to do localization without bounding box supervision. However, unlike our work, these require object-level supervision while we only use scenes. 2 I MAGE N ET-CNN AND P LACES -CNN Convolutional neural networks have recently obtained astonishing performance on object classi?cation (Krizhevsky et al., 2012) and scene classi?cation (Zhou et al., 2014). The ImageNet-CNN from Jia (2013) is trained on 1.3 million images from 1000 object categories of ImageNet (ILSVRC 2012) and achieves a top-1 accuracy of 57.4%. With the same network architecture, Places-CNN is trained on 2.4 million images from 205 scene categories of Places Database (Zhou et al., 2014), and achieves a top-1 accuracy of 50.0%. The network architecture used for both CNNs, as proposed in (Krizhevsky et al., 2012), is summarized in Table 11 . Both networks are trained from scratch using only the speci?ed dataset. The deep features from Places-CNN tend to perform better on scene-related recognition tasks compared to the features from ImageNet-CNN. For example, as compared to the Places-CNN that achieves 50.0% on scene classi?cation, the ImageNet-CNN combined with a linear SVM only achieves 40.8% on the same test set2 illustrating the importance of having scene-centric data. To further highlight the difference in representations, we conduct a simple experiment to identify the differences in the type of images preferred at the different layers of each network: we create a set of 200k images with an approximately equal distribution of scene-centric and object-centric images3 , and run them through both networks, recording the activations at each layer. For each layer, we obtain the top 100 images that have the largest average activation (sum over all spatial locations for We use unit to refer to neurons in the various layers and features to refer to their activations. Scene recognition demo of Places-CNN is available at http://places.csail.mit.edu/demo. html. The demo has 77.3% top-5 recognition rate in the wild estimated from 968 anonymous user responses. 3 100k object-centric images from the test set of ImageNet LSVRC2012 and 108k scene-centric images from the SUN dataset (Xiao et al., 2014). 2 1 2 Published as a conference paper at ICLR 2015 pool1 ImageNet-CNN pool2 conv3
深度卷积网络中有检测器
conv4 pool5 fc7 Places-CNN Figure 1: Top 3 images producing the largest activation of units in each layer of ImageNet-CNN (top) and Places-CNN (bottom). a given layer). Fig. 1 shows the top 3 images for each layer. We observe that the earlier layers such as pool1 and pool2 prefer similar images for both networks while the later layers tend to be more specialized to the speci?c task of scene or object categorization. For layer pool2, 55% and 47% of the top-100 images belong to the ImageNet dataset for ImageNet-CNN and Places-CNN. Starting from layer conv4, we observe a signi?cant difference in the number of top-100 belonging to each dataset corresponding to each network. For fc7, we observe that 78% and 24% of the top-100 images belong to the ImageNet dataset for the ImageNet-CNN and Places-CNN respectively, illustrating a clear bias in each network. In the following sections, we further investigate the differences between these networks, and focus on better understanding the nature of the representation learned by Places-CNN when doing scene classi?cation in order to clarify some part of the secret to their great performance. 3 U NCOVERING THE CNN REPRESENTATION The performance of scene recognition using Places-CNN is quite impressive given the dif?culty of the task. In this section, our goal is to understand the nature of the representation that the network is learning. 3.1 S IMPLIFYING THE INPUT IMAGES Simplifying images is a well known strategy to test human recognition. For example, one can remove information from the image to test if it is diagnostic or not of a particular object or scene (for a review see Biederman (1995)). A similar procedure was also used by Tanaka (1993) to understand the receptive ?elds of complex cells in the inferior temporal cortex (IT). Inspired by these approaches, our idea is the following: given an image that is correctly classi?ed by the network, we want to simplify this image such that it keeps as little visual information as possible while still having a high classi?cation score for the same category. This simpli?ed image (named minimal image representation) will allow us to highlight the elements that lead to the high classi?cation score. In order to do this, we manipulate images in the gradient space as typically done in computer graphics (P´ erez et al., 2003). We investigate two different approaches described below. In the ?rst approach, given an image, we create a segmentation of edges and regions and remove segments from the image iteratively. At each iteration we remove the segment that produces the smallest decrease of the correct classi?cation score and we do this until the image is incorrectly classi?ed. At the end, we get a representation of the original image that contains, approximately, the minimal amount of information needed by the network to correctly recognize the scene category. In Fig. 2 we show some examples of these minimal image representations. Notice
深度卷积网络中有检测器
that objects seem to contribute important information for the network to recognize the scene. For instance, in the case of bedrooms these minimal image representations usually contain the region of the bed, or in the art gallery category, the regions of the paintings on the walls. Based on the previous results, we hypothesized that for the Places-CNN, some objects were crucial for recognizing scenes. This inspired our second approach: we generate the minimal image representations using the fully annotated image set of SUN Database (Xiao et al., 2014) (see section 4.1 for details on this dataset) instead of performing automatic segmentation. We follow the same procedure as the ?rst approach using the ground-truth object segments provided in the database. This led to some interesting observations: for bedrooms, the minimal representations retained the bed in 87% of the cases. Other objects kept in bedrooms were wall (28%) and window (21%). 3 Published as a conference paper at ICLR 2015 Figure 2: Each pair of images shows the original image (left) and a simpli?ed image (right) that gets classi?ed by the Places-CNN as the same scene category as the original image. From top to bottom, the four rows show different scene categories: bedroom, auditorium, art gallery, and dining room. discrepancy maps for top 10 images sliding-window stimuli calibrated discrepancy maps receptive ?eld Figure 3: The pipeline for estimating the RF of each unit. Each sliding-window stimuli contains a small randomized patch (example indicated by red arrow) at different spatial locations. By comparing the activation response of the sliding-window stimuli with the activation response of the original image, we obtain a discrepancy map for each image (middle top). By summing up the calibrated discrepancy maps (middle bottom) for the top ranked images, we obtain the actual RF of that unit (right). For art gallery the minimal image representations contained paintings (81%) and pictures (58%); in amusement parks, carousel (75%), ride (64%), and roller coaster (50%); in bookstore, bookcase (96%), books (68%), and shelves (67%). These results suggest that object detection is an important part of the representation built by the network to obtain discriminative information for scene classi?cation. 3.2 V ISUALIZING THE RECEPTIVE FIELDS OF UNITS AND THEIR ACTIVATION PATTERNS In this section, we investigate the shape and size of the receptive ?elds (RFs) of the various units in the CNNs. While theoretical RF sizes can be computed given the network architecture (Long et al., 2014), we are interested in the actual, or empirical size of the RFs. We expect the empirical RFs to be better localized and more representative of the information they capture than the theoretical ones, allowing us to better understand what is learned by each unit of the CNN. Thus, we propose a data-driven approach to estimate the learned RF of each unit in each layer. It is simpler than the deconv
下载文档
热门试卷
- 2016年四川省内江市中考化学试卷
- 广西钦州市高新区2017届高三11月月考政治试卷
- 浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
- 浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
- 辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
- 广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
- 广西钦州市钦州港区2017届高三11月月考政治试卷
- 广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
- 广西钦州市高新区2016-2017学年高二11月月考政治试卷
- 广西钦州市高新区2016-2017学年高一11月月考政治试卷
- 山东省滨州市三校2017届第一学期阶段测试初三英语试题
- 四川省成都七中2017届高三一诊模拟考试文科综合试卷
- 2017届普通高等学校招生全国统一考试模拟试题(附答案)
- 重庆市永川中学高2017级上期12月月考语文试题
- 江西宜春三中2017届高三第一学期第二次月考文科综合试题
- 内蒙古赤峰二中2017届高三上学期第三次月考英语试题
- 2017年六年级(上)数学期末考试卷
- 2017人教版小学英语三年级上期末笔试题
- 江苏省常州西藏民族中学2016-2017学年九年级思想品德第一学期第二次阶段测试试卷
- 重庆市九龙坡区七校2016-2017学年上期八年级素质测查(二)语文学科试题卷
- 江苏省无锡市钱桥中学2016年12月八年级语文阶段性测试卷
- 江苏省无锡市钱桥中学2016-2017学年七年级英语12月阶段检测试卷
- 山东省邹城市第八中学2016-2017学年八年级12月物理第4章试题(无答案)
- 【人教版】河北省2015-2016学年度九年级上期末语文试题卷(附答案)
- 四川省简阳市阳安中学2016年12月高二月考英语试卷
- 四川省成都龙泉中学高三上学期2016年12月月考试题文科综合能力测试
- 安徽省滁州中学2016—2017学年度第一学期12月月考高三英语试卷
- 山东省武城县第二中学2016.12高一年级上学期第二次月考历史试题(必修一第四、五单元)
- 福建省四地六校联考2016-2017学年上学期第三次月考高三化学试卷
- 甘肃省武威第二十三中学2016—2017学年度八年级第一学期12月月考生物试卷
网友关注
- 十大神剑之泰阿剑的传说
- 中国历史上的五大瑞兽和四大凶兽
- 印度教的保护之神:毗湿奴
- 十大神剑之七星龙渊剑的传说
- 十大神剑之鱼肠剑的故事
- 印度教的毁灭之神:湿婆
- 元宵节的作文:2016年我最喜欢元宵节
- 周星驰与历任星女郎的恩爱情仇
- 孔甲养龙的故事
- 印度教的创造之神:大梵天
- 2016年元宵节作文600字:看花灯
- 十大神剑之赤霄剑的来历
- 印度神话中那个自以为是创造之主的大梵天王
- 盘点希腊诸神十大栖居地
- 揭秘大唐史:传奇燕王罗艺
- 中国传统十大名花排名
- 神农尝百草的故事
- 我国古代帝王身上的“龙袍”是怎么演变的?
- 东周列国故事之齐恒公迟暮
- 英雄联盟S6野兽之灵乌迪尓野区崛起
- 十大神剑之首轩辕剑的来历和传说
- 东周列国故事之五张羊皮百里奚
- 传说中的九头鸟是什么东西
- 2016关于元宵节的作文:今天,是元宵节
- 东周列国故事之唇寒齿亡
- 中国历史上的十大巾帼英雄
- 夏朝的建立相当于西方的亚特兰蒂斯
- 2016年元宵节作文:元宵随记
- 2016年元宵节的作文:甜美的汤圆
- 英雄联盟S6上单一霸深海泰坦诺提勒斯
网友关注视频
- 苏教版二年级下册数学《认识东、南、西、北》
- 【部编】人教版语文七年级下册《逢入京使》优质课教学视频+PPT课件+教案,辽宁省
- 沪教版牛津小学英语(深圳用) 六年级下册 Unit 7
- 8.对剪花样_第一课时(二等奖)(冀美版二年级上册)_T515402
- 冀教版小学数学二年级下册第二单元《租船问题》
- 沪教版牛津小学英语(深圳用) 五年级下册 Unit 12
- 沪教版牛津小学英语(深圳用) 四年级下册 Unit 4
- 小学英语单词
- 北师大版数学四年级下册第三单元第四节街心广场
- 化学九年级下册全册同步 人教版 第18集 常见的酸和碱(二)
- 外研版英语三起5年级下册(14版)Module3 Unit1
- 《空中课堂》二年级下册 数学第一单元第1课时
- 3月2日小学二年级数学下册(数一数)
- 8.练习八_第一课时(特等奖)(苏教版三年级上册)_T142692
- 第五单元 民族艺术的瑰宝_16. 形形色色的民族乐器_第一课时(岭南版六年级上册)_T1406126
- 8 随形想象_第一课时(二等奖)(沪教版二年级上册)_T3786594
- 七年级下册外研版英语M8U2reading
- 第8课 对称剪纸_第一课时(二等奖)(沪书画版二年级上册)_T3784187
- 冀教版小学数学二年级下册1
- 苏科版八年级数学下册7.2《统计图的选用》
- 人教版历史八年级下册第一课《中华人民共和国成立》
- 【部编】人教版语文七年级下册《泊秦淮》优质课教学视频+PPT课件+教案,天津市
- 外研版英语三起5年级下册(14版)Module3 Unit2
- 沪教版牛津小学英语(深圳用) 五年级下册 Unit 10
- 苏科版数学 八年级下册 第八章第二节 可能性的大小
- 二年级下册数学第三课 搭一搭⚖⚖
- 沪教版牛津小学英语(深圳用) 四年级下册 Unit 3
- 第19课 我喜欢的鸟_第一课时(二等奖)(人美杨永善版二年级下册)_T644386
- 冀教版英语四年级下册第二课
- 化学九年级下册全册同步 人教版 第25集 生活中常见的盐(二)
精品推荐
- 2016-2017学年高一语文人教版必修一+模块学业水平检测试题(含答案)
- 广西钦州市高新区2017届高三11月月考政治试卷
- 浙江省湖州市2016-2017学年高一上学期期中考试政治试卷
- 浙江省湖州市2016-2017学年高二上学期期中考试政治试卷
- 辽宁省铁岭市协作体2017届高三上学期第三次联考政治试卷
- 广西钦州市钦州港区2016-2017学年高二11月月考政治试卷
- 广西钦州市钦州港区2017届高三11月月考政治试卷
- 广西钦州市钦州港区2016-2017学年高一11月月考政治试卷
- 广西钦州市高新区2016-2017学年高二11月月考政治试卷
- 广西钦州市高新区2016-2017学年高一11月月考政治试卷
分类导航
- 互联网
- 电脑基础知识
- 计算机软件及应用
- 计算机硬件及网络
- 计算机应用/办公自动化
- .NET
- 数据结构与算法
- Java
- SEO
- C/C++资料
- linux/Unix相关
- 手机开发
- UML理论/建模
- 并行计算/云计算
- 嵌入式开发
- windows相关
- 软件工程
- 管理信息系统
- 开发文档
- 图形图像
- 网络与通信
- 网络信息安全
- 电子支付
- Labview
- matlab
- 网络资源
- Python
- Delphi/Perl
- 评测
- Flash/Flex
- CSS/Script
- 计算机原理
- PHP资料
- 数据挖掘与模式识别
- Web服务
- 数据库
- Visual Basic
- 电子商务
- 服务器
- 搜索引擎优化
- 存储
- 架构
- 行业软件
- 人工智能
- 计算机辅助设计
- 多媒体
- 软件测试
- 计算机硬件与维护
- 网站策划/UE
- 网页设计/UI
- 网吧管理