지금까지 로봇에게 어떤 작업을 가르칠 때에는 각각의 cost function이 필요했고 이 함수를 고안하는 것은 복잡한 과정이었습니다.
그러나 사람은 타인이 하는 작업을 보고 손쉽게 어떻게 그 물건을 다루어야 할 지를 배웁니다. 이를 로봇에게도 적용시키기 위한 과정에 대한 논문입니다. 






Machine learning can allow robots to acquire complex skills, such as grasping and opening doors. However, learning these skills requires us to manually program reward functions that the robots then attempt to optimize. In contrast, people can understand the goal of a task just from watching someone else do it, or simply by being told what the goal is. We can do this because we draw on our own prior knowledge about the world: when we see someone cut an apple, we understand that the goal is to produce two slices, regardless of what type of apple it is, or what kind of tool is used to cut it. Similarly, if we are told to pick up the apple, we understand which object we are to grab because we can ground the word “apple” in the environment: we know what it means. 

These are semantic concepts: salient events like producing two slices, and object categories denoted by words such as “apple.” Can we teach robots to understand semantic concepts, to get them to follow simple commands specified through categorical labels or user-provided examples? In this post, we discuss some of our recent work on robotic learning that combines experience that is autonomously gathered by the robot, which is plentiful but lacks human-provided labels, with human-labeled data that allows a robot to understand semantics. We will describe how robots can use their experience to understand the salient events in a human-provided demonstration, mimic human movements despite the differences between human robot bodies, and understand semantic categories, like “toy” and “pen”, to pick up objects based on user commands.

Understanding human demonstrations with deep visual features
In the first set of experiments, which appear in our paper Unsupervised Perceptual Rewards for Imitation Learning, our is aim is to enable a robot to understand a task, such as opening a door, from seeing only a small number of unlabeled human demonstrations. By analyzing these demonstrations, the robot must understand what is the semantically salient event that constitutes task success, and then use reinforcement learning to perform it.
Examples of human demonstrations (left) and the corresponding robotic imitation (right).
Unsupervised learning on very small datasets is one of the most challenging scenarios in machine learning. To make this feasible, we use deep visual features from a large network trained for image recognition on ImageNet. Such features are known to be sensitive to semantic concepts, while maintaining invariance to nuisance variables such as appearance and lighting. We use these features to interpret user-provided demonstrations, and show that it is indeed possible to learn reward functions in an unsupervised fashion from a few demonstrations and without retraining.
Example of reward functions learned solely from observation for the door opening tasks. Rewards progressively increase from zero to the maximum reward as a task is completed.
After learning a reward function from observation only, we use it to guide a robot to learn a door opening task, using only the images to evaluate the reward function. With the help of an initial kinesthetic demonstration that succeeds about 10% of the time, the robot learns to improve to 100% accuracy using the learned reward function.
Learning progression.
Emulating human movements with self-supervision and imitation.
In Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation, we propose a novel approach to learn about the world from observation and demonstrate it through self-supervised pose imitation. Our approach relies primarily on co-occurrence in time and space for supervision: by training to distinguish frames from different times of a video, it learns to disentangle and organize reality into useful abstract representations. 

In a pose imitation task for example, different dimensions of the representation may encode for different joints of a human or robotic body. Rather than defining by hand a mapping between human and robot joints (which is ambiguous in the first place because of physiological differences), we let the robot learn to imitate in an end-to-end fashion. When our model is simultaneously trained on human and robot observations, it naturally discovers the correspondence between the two, even though no correspondence is provided. We thus obtain a robot that can imitate human poses without having ever been given a correspondence between humans and robots.
Self-supervised human pose imitation by a robot.
A striking evidence of the benefits of learning end-to-end is the many-to-one and highly non-linearjoints mapping shown above. In this example, the up-down motion involves many joints for the human while only one joint is needed for the robot. We show that the robot has discovered this highly complex mapping on its own, without any explicit human pose information.

Grasping with semantic object categories
The experiments above illustrate how a person can specify a goal for a robot through an example demonstration, in which case the robots must interpret the semantics of the task -- salient events and relevant features of the pose. What if instead of showing the task, the human simply wants to tell it to what to do? This also requires the robot to understand semantics, in order to identify which objects in the world correspond to the semantic category specified by the user. In End-to-End Learning of Semantic Grasping, we study how a combination of manually labeled and autonomously collected data can be used to perform the task of semantic grasping, where the robot must pick up an object from a cluttered bin that matches a user-specified class label, such as “eraser” or “toy.” 
In our semantic grasping setup, the robotic arm is tasked with picking up an object corresponding to a user-provided semantic category (e.g. Legos).
To learn how to perform semantic grasping, our robots first gather a large dataset of grasping data by autonomously attempting to pick up a large variety of objects, as detailed in our previous postand prior work. This data by itself can allow a robot to pick up objects, but doesn’t allow it to understand how to associate them with semantic labels. To enable an understanding of semantics, we again enlist a modest amount of human supervision. Each time a robot successfully grasps an object, it presents it to the camera in a canonical pose, as illustrated below.
The robot presents objects to the camera after grasping. These images can be used to label which object category was picked up.
A subset of these images is then labeled by human labelers. Since the presentation images show the object in a canonical pose, it is easy to then propagate these labels to the remaining presentation images by training a classifier on the labeled examples. The labeled presentation images then tell the robot which object was actually picked up, and it can associate this label, in hindsight, with the images that it observed while picking up that object from the bin.

Using this labeled dataset, we can then train a two-stream model that predicts which object will be grasped, conditioned on the current image and the actions that the robot might take. The two-stream model that we employ is inspired by the dorsal-ventral decomposition observed in the human visual cortex, where the ventral stream reasons about the semantic class of objects, while the dorsal stream reasons about the geometry of the grasp. Crucially, the ventral stream can incorporate auxiliary data consisting of labeled images of objects (not necessarily from the robot), while the dorsal stream can incorporate auxiliary data of grasping that does not have semantic labels, allowing the entire system to be trained more effectively using larger amounts of heterogeneously labeled data. In this way, we can combine a limited amount of human labels with a large amount of autonomously collected robotic data to grasp objects based on desired semantic category, as illustrated in the video below:
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/WR5WUKXUQ8U/0.jpg" frameborder="0" height="360" src="https://www.youtube.com/embed/WR5WUKXUQ8U?rel=0&feature=player_embedded" width="640" style="max-width: 100%; margin: 15px 0px 20px;"></iframe>
Future Work
Our experiments show how limited semantically labeled data can be combined with data that is collected and labeled automatically by the robots, in order to enable robots to understand events, object categories, and user demonstrations. In the future, we might imagine that robotic systems could be trained with a combination of user-annotated data and ever-increasing autonomously collected datasets, improving robotic capability and easing the engineering burden of designing autonomous robots. Furthermore, as robotic systems collect more and more automatically annotated data in the real world, this data can be used to improve not just robotic systems, but also systems for computer vision, speech recognition, and natural language processing that can all benefit from such large auxiliary data sources.

Of course, we are not the first to consider the intersection of robotics and semantics. Extensive prior work in natural language understandingrobotic perceptiongrasping, and imitation learninghas considered how semantics and action can be combined in a robotic system. However, the experiments we discussed above might point the way to future work into combining self-supervised and human-labeled data in the context of autonomous robotic systems.

Acknowledgements
The research described in this post was performed by Pierre Sermanet, Kelvin Xu, Corey Lynch, Jasmine Hsu, Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, and Sergey Levine. We also thank Mrinal Kalakrishnan, Ali Yahya, and Yevgen Chebotar for developing the policy learning framework used for the door task, and John-Michael Burke for conducting experiments for semantic grasping.

Unsupervised Perceptual Rewards for Imitation Learning was presented at RSS 2017 by Kelvin Xu, and Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation will be presented this week at the CVPR Workshop on Deep Learning for Robotic Vision.

profile
번호
제목
글쓴이
43 HUAWEI's Kirin 970 chipset for A.I. image
jsh6293
67   2017-09-13
지난 8월 27일부터 9월 1일까지 베를린에서 개최된 IFA 2017에서 가장 주목을 받은 것은 아마존과 구글의 인공지능 및 스마트홈 기술이었습니다.구글의 구글 어시스턴트, 애플의 시리, 삼성의 빅스비 그리고 아마존의 알렉사 등...  
42 Open source speech recognition toolkit Kaldi now offers TensorFlow integration
최소연
59   2017-09-12
Speech recognition 할 때 많이 쓰이는 툴킷인 Kaldi가 이제는 TensorFlow 통합을 제공하여 연구원과 개발자가 Kaldi 음성 인식 파이프 라인에서 심층 학습 모델을 탐색하고 배포 할 수 있게 되었습니다. 이 협력 덕분에 ...  
41 NASA is using Intel’s deep learning to build better moon maps image
양원
200   2017-08-22
Source: https://techcrunch.c-om/2017/08/18/nasa-is-using-intels-deep-learning-to-find-better-landing-sites-on-the-moon/The twitter of Intel AI: https://twitter.com/-IntelAI/status/89832-8109186826241Intel engineers recently ...  
40 Imagination - augmented agent
변경근
205   2017-08-08
출처: https://deepmind.com-/blog/agents-imagine-and-plan/번역 출처: http://www.joysf.com-/forum_sf/4956845안녕하세요 DeepMind에서 개발한 새로운 알고리즘인 Imagination - augmented agent 를 소개드립니다.사람이 어떤 행동을 할 때...  
39 3D immersive audio headphone
문현기
203   2017-08-08
작년, 크라우드 펀딩사이트에서 OSSIC X 라는 헤드폰을 위한 자금모금을 시작하였습니다.위 헤드폰은 기존 헤드폰과 달리 진정한 3차원 오디오를 제공할 수 있다고 홍보하고 있는데요.3차원 오디오를 제공하려면 잘 아시다시피 머...  
38 구글 딥마인드의 AI 관련 연구 동향
qwinyjh
259   2017-08-04
AI 와 NN 이 가장 핫한 요즘인데요구글 딥마인드에서 진행하고 있는 연구 현황을 소개해드립니다먼저 세간의 화제였던 알파고와 관련하여그간 인간 vs 알파고의 대국을 진행해왔던 딥마인드에서처음으로 알파고 vs 알파고의 대국...  
로봇에게 의미론적 개념 이해를 가르치기: google reasearch image
강현주
389   2017-07-31
지금까지 로봇에게 어떤 작업을 가르칠 때에는 각각의 cost function이 필요했고 이 함수를 고안하는 것은 복잡한 과정이었습니다.그러나 사람은 타인이 하는 작업을 보고 손쉽게 어떻게 그 물건을 다루어야 할 지를 배웁니다. ...  
36 한글 인코딩의 이해 1편 : 한글 인코딩의 역사와 유니코드
권오성
196   2017-07-31
[출처] http://d2.naver.com/-helloworld/19187한글을 이용하여 프로그래밍을 할 경우 가끔씩 Terminal 화면에서 원하는 한글의 모양이 나오지 않는 경우가 있습니다.이러한 경우를 해결하기 위해 한글 인코딩에 대한 기본적인 내용...  
35 The New & Fast AI Tech Can Mimic Any Voice
hmj234
172   2017-07-24
Montreal 에 본사를 둔 Lyrebird 라는 회사에서 고속으로 음성을 합성하고, 다른 화자의 적은 데이터로도 음성 모방이 가능한 TTS 시스템을 개발하였다는 내용입니다.개발자는 Lyrebird의 TTS 시스템은 1초에 수천 문장이 생...  
34 Microsoft's New Smartphone App: Seeing AI image
최소연
510   2017-07-20
마이크로소프트가 저번 주 Seeing AI라는 새로운 스마트폰 앱을 출시했는데 이 Seeing AI는 스마트폰에 내장된 컴퓨터 비전을 이용해 시각 장애인 주변의 모습을 말로 설명해 주는 스마트폰용 앱으로 모든 작업은 스마트폰에서...  
33 Tensor2Tensor Library for Effective Deep-learning based Research
jsh6293
182   2017-07-17
지난 달, 구글 브레인 팀에서는 텐서플로우를 통해 확산된 딥러닝의 효율을 더욱 효과적으로 증가시키고자 Tensor2Tensor (이하 T2T) 라이브러리를 공개했습니다. T2T는 딥러닝 모델 훈련을 간소화하여 개발자의 머신러닝에 관한 ...  
32 [시장전망] AI 통번역의 진화 imagefile
jylee
229   2017-07-12
최근, 기계번역(Machine Translation) 기술에 신경망 기술이 도입되며 정확도가 올라가고 시장 규모가 점점 커지고 있습니다.해당 기사는 이전의 규칙 기반/통계 기반의 기계번역 기술의 장단점과 현재 사용되는 신경망 기반의 기...  
31 NVIDIA TensorRT: inference engine of deep learning structure image
양해민
221   2017-07-12
지난 6월에 NVIDIA 에서 TensorRT2 라이브러리를 배포했습니다.TensorRT 라는 첫번째 버전에서 발전시킨 라이브러리라고 합니다.이 TensorRT 는 학습이 완료된 신경망 회로를 실제 환경에서 구현하기 위해 사용할 수 있는 ...  
30 2017 International Conference on Learning Representations
양원
217   2017-07-09
On April 24 2017, 5th International Conference on Learning Representations held in Toulon, France. It's a conference focused on how one can learn meaningful and useful representations of data...  
29 인류의 발전, 인공지능 시대를 맞이할 우리 image
kbkim
316   2017-06-30
2년전인 2015년에 Tim Urban이 아래 링크에 게시한 내용인데 꽤 길지만 재미있는 내용의 글입니다.원문을 읽어보실분은Part1: The AI Revolution: The Road to SuperintelligencePart2: The AI Revolution: Our Immort...  
28 애플, 모바일 머신러닝 기술 공개..."구글보다 6배 빨라" image
qwinyjh
233   2017-06-30
애플 사에서 모바일 기기에 특화된 머신러닝 프레임워크를 발표하였습니다.경쟁사인 구글과 페이스북의 머신러닝 프레임워크에 대항하는 기술이며, iOS 11 에 포함될 것이라고 합니다아래는 기사 전문입니다.--------------------------------------...  
27 협상할 수 있는 챗봇, 페이스북 ai 연구소에서 개발 및 공개 배포 image
강현주
566   2017-06-27
지금까지의 챗봇들은 식당 예약 등의 간단한 명령 수행이나, 사용자와의 짧은 대화만 진행했다면이번 페이스북연구소에서 릴리즈한 챗봇은서로 다른 목표를 가진 상황에서 협상을 통해 양측이 찬성할 수 있는 결과에 도달할 수 있...  
26 [비즈 인터뷰] "입체 음향으로 청각을 지배하라"...'VR계의 돌비' 꿈꾸는 오현오 가우디오 대표 image
문현기
453   2017-06-26
얼마전에 가우디오랩이 어떤 일을 하고 있는지 조선일보와 인터뷰를 했었군요가우디오랩은 연구실 선배님이신 오현오 박사님이 창립하였고많은 연구실 선배님들이 함께 일하고 있으며VR, AR 관련 오디오 기술을 제작하는 회사입니다...  
25 한국 전자통신연구원, 딥러닝 기반 음성번역 기술 9개 언어로 확대 성공 image
hmj234
282   2017-06-19
한국 전자통신 연구원 (ETRI) 에서 딥러닝을 이용해 9개의 언어에 대한 음성인식 및 문자변환이 가능한 기술을 개발했다는 내용입니다.아래는 기사 전문입니다.=======================================================...  
24 Learning with light: neural network could be implemented more quickly
최소연
287   2017-06-17
Normal 0 false false false EN-US JA X-NONE ...