지금까지 로봇에게 어떤 작업을 가르칠 때에는 각각의 cost function이 필요했고 이 함수를 고안하는 것은 복잡한 과정이었습니다.
그러나 사람은 타인이 하는 작업을 보고 손쉽게 어떻게 그 물건을 다루어야 할 지를 배웁니다. 이를 로봇에게도 적용시키기 위한 과정에 대한 논문입니다. 






Machine learning can allow robots to acquire complex skills, such as grasping and opening doors. However, learning these skills requires us to manually program reward functions that the robots then attempt to optimize. In contrast, people can understand the goal of a task just from watching someone else do it, or simply by being told what the goal is. We can do this because we draw on our own prior knowledge about the world: when we see someone cut an apple, we understand that the goal is to produce two slices, regardless of what type of apple it is, or what kind of tool is used to cut it. Similarly, if we are told to pick up the apple, we understand which object we are to grab because we can ground the word “apple” in the environment: we know what it means. 

These are semantic concepts: salient events like producing two slices, and object categories denoted by words such as “apple.” Can we teach robots to understand semantic concepts, to get them to follow simple commands specified through categorical labels or user-provided examples? In this post, we discuss some of our recent work on robotic learning that combines experience that is autonomously gathered by the robot, which is plentiful but lacks human-provided labels, with human-labeled data that allows a robot to understand semantics. We will describe how robots can use their experience to understand the salient events in a human-provided demonstration, mimic human movements despite the differences between human robot bodies, and understand semantic categories, like “toy” and “pen”, to pick up objects based on user commands.

Understanding human demonstrations with deep visual features
In the first set of experiments, which appear in our paper Unsupervised Perceptual Rewards for Imitation Learning, our is aim is to enable a robot to understand a task, such as opening a door, from seeing only a small number of unlabeled human demonstrations. By analyzing these demonstrations, the robot must understand what is the semantically salient event that constitutes task success, and then use reinforcement learning to perform it.
Examples of human demonstrations (left) and the corresponding robotic imitation (right).
Unsupervised learning on very small datasets is one of the most challenging scenarios in machine learning. To make this feasible, we use deep visual features from a large network trained for image recognition on ImageNet. Such features are known to be sensitive to semantic concepts, while maintaining invariance to nuisance variables such as appearance and lighting. We use these features to interpret user-provided demonstrations, and show that it is indeed possible to learn reward functions in an unsupervised fashion from a few demonstrations and without retraining.
Example of reward functions learned solely from observation for the door opening tasks. Rewards progressively increase from zero to the maximum reward as a task is completed.
After learning a reward function from observation only, we use it to guide a robot to learn a door opening task, using only the images to evaluate the reward function. With the help of an initial kinesthetic demonstration that succeeds about 10% of the time, the robot learns to improve to 100% accuracy using the learned reward function.
Learning progression.
Emulating human movements with self-supervision and imitation.
In Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation, we propose a novel approach to learn about the world from observation and demonstrate it through self-supervised pose imitation. Our approach relies primarily on co-occurrence in time and space for supervision: by training to distinguish frames from different times of a video, it learns to disentangle and organize reality into useful abstract representations. 

In a pose imitation task for example, different dimensions of the representation may encode for different joints of a human or robotic body. Rather than defining by hand a mapping between human and robot joints (which is ambiguous in the first place because of physiological differences), we let the robot learn to imitate in an end-to-end fashion. When our model is simultaneously trained on human and robot observations, it naturally discovers the correspondence between the two, even though no correspondence is provided. We thus obtain a robot that can imitate human poses without having ever been given a correspondence between humans and robots.
Self-supervised human pose imitation by a robot.
A striking evidence of the benefits of learning end-to-end is the many-to-one and highly non-linearjoints mapping shown above. In this example, the up-down motion involves many joints for the human while only one joint is needed for the robot. We show that the robot has discovered this highly complex mapping on its own, without any explicit human pose information.

Grasping with semantic object categories
The experiments above illustrate how a person can specify a goal for a robot through an example demonstration, in which case the robots must interpret the semantics of the task -- salient events and relevant features of the pose. What if instead of showing the task, the human simply wants to tell it to what to do? This also requires the robot to understand semantics, in order to identify which objects in the world correspond to the semantic category specified by the user. In End-to-End Learning of Semantic Grasping, we study how a combination of manually labeled and autonomously collected data can be used to perform the task of semantic grasping, where the robot must pick up an object from a cluttered bin that matches a user-specified class label, such as “eraser” or “toy.” 
In our semantic grasping setup, the robotic arm is tasked with picking up an object corresponding to a user-provided semantic category (e.g. Legos).
To learn how to perform semantic grasping, our robots first gather a large dataset of grasping data by autonomously attempting to pick up a large variety of objects, as detailed in our previous postand prior work. This data by itself can allow a robot to pick up objects, but doesn’t allow it to understand how to associate them with semantic labels. To enable an understanding of semantics, we again enlist a modest amount of human supervision. Each time a robot successfully grasps an object, it presents it to the camera in a canonical pose, as illustrated below.
The robot presents objects to the camera after grasping. These images can be used to label which object category was picked up.
A subset of these images is then labeled by human labelers. Since the presentation images show the object in a canonical pose, it is easy to then propagate these labels to the remaining presentation images by training a classifier on the labeled examples. The labeled presentation images then tell the robot which object was actually picked up, and it can associate this label, in hindsight, with the images that it observed while picking up that object from the bin.

Using this labeled dataset, we can then train a two-stream model that predicts which object will be grasped, conditioned on the current image and the actions that the robot might take. The two-stream model that we employ is inspired by the dorsal-ventral decomposition observed in the human visual cortex, where the ventral stream reasons about the semantic class of objects, while the dorsal stream reasons about the geometry of the grasp. Crucially, the ventral stream can incorporate auxiliary data consisting of labeled images of objects (not necessarily from the robot), while the dorsal stream can incorporate auxiliary data of grasping that does not have semantic labels, allowing the entire system to be trained more effectively using larger amounts of heterogeneously labeled data. In this way, we can combine a limited amount of human labels with a large amount of autonomously collected robotic data to grasp objects based on desired semantic category, as illustrated in the video below:
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/WR5WUKXUQ8U/0.jpg" frameborder="0" height="360" src="https://www.youtube.com/embed/WR5WUKXUQ8U?rel=0&feature=player_embedded" width="640" style="max-width: 100%; margin: 15px 0px 20px;"></iframe>
Future Work
Our experiments show how limited semantically labeled data can be combined with data that is collected and labeled automatically by the robots, in order to enable robots to understand events, object categories, and user demonstrations. In the future, we might imagine that robotic systems could be trained with a combination of user-annotated data and ever-increasing autonomously collected datasets, improving robotic capability and easing the engineering burden of designing autonomous robots. Furthermore, as robotic systems collect more and more automatically annotated data in the real world, this data can be used to improve not just robotic systems, but also systems for computer vision, speech recognition, and natural language processing that can all benefit from such large auxiliary data sources.

Of course, we are not the first to consider the intersection of robotics and semantics. Extensive prior work in natural language understandingrobotic perceptiongrasping, and imitation learninghas considered how semantics and action can be combined in a robotic system. However, the experiments we discussed above might point the way to future work into combining self-supervised and human-labeled data in the context of autonomous robotic systems.

Acknowledgements
The research described in this post was performed by Pierre Sermanet, Kelvin Xu, Corey Lynch, Jasmine Hsu, Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, and Sergey Levine. We also thank Mrinal Kalakrishnan, Ali Yahya, and Yevgen Chebotar for developing the policy learning framework used for the door task, and John-Michael Burke for conducting experiments for semantic grasping.

Unsupervised Perceptual Rewards for Imitation Learning was presented at RSS 2017 by Kelvin Xu, and Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation will be presented this week at the CVPR Workshop on Deep Learning for Robotic Vision.

profile
번호
제목
글쓴이
45 Neural Discrete Representation Learning
권오성
2017-11-10 62
44 Word2vec 이해하기 image
kbkim
2017-11-06 75
43 HUAWEI's Kirin 970 chipset for A.I. image
jsh6293
2017-09-13 413
42 Open source speech recognition toolkit Kaldi now offers TensorFlow integration
최소연
2017-09-12 515
41 NASA is using Intel’s deep learning to build better moon maps image
양원
2017-08-22 518
40 Imagination - augmented agent
변경근
2017-08-08 616
39 3D immersive audio headphone
문현기
2017-08-08 491
38 구글 딥마인드의 AI 관련 연구 동향
qwinyjh
2017-08-04 472
로봇에게 의미론적 개념 이해를 가르치기: google reasearch image
강현주
2017-07-31 778
36 한글 인코딩의 이해 1편 : 한글 인코딩의 역사와 유니코드
권오성
2017-07-31 433
35 The New & Fast AI Tech Can Mimic Any Voice
hmj234
2017-07-24 360
34 Microsoft's New Smartphone App: Seeing AI image
최소연
2017-07-20 1029
33 Tensor2Tensor Library for Effective Deep-learning based Research
jsh6293
2017-07-17 354
32 [시장전망] AI 통번역의 진화 imagefile
jylee
2017-07-12 486
31 NVIDIA TensorRT: inference engine of deep learning structure image
양해민
2017-07-12 536
30 2017 International Conference on Learning Representations
양원
2017-07-09 480
29 인류의 발전, 인공지능 시대를 맞이할 우리 image
kbkim
2017-06-30 733
28 애플, 모바일 머신러닝 기술 공개..."구글보다 6배 빨라" image
qwinyjh
2017-06-30 443
27 협상할 수 있는 챗봇, 페이스북 ai 연구소에서 개발 및 공개 배포 image
강현주
2017-06-27 910
26 [비즈 인터뷰] "입체 음향으로 청각을 지배하라"...'VR계의 돌비' 꿈꾸는 오현오 가우디오 대표 image
문현기
2017-06-26 835