|
1 |
- We,consider,the,fully,automated,recognition,of,actions,in,uncontrolled,environment,Most,existing,work,relies,on,domain,knowledge,to,construct,complex,handcrafted,features,from,inputs,In,addition,the,environments,are,usually,assumed,to,be,controlled,Convolu-,tional,neural,networks,(CNNs),are,a,type,of,deep,models,that,can,act,directly,on,the,raw,inputs,thus,automating,the,process,of,fea-,ture,construction,However,such,models,are,currently,limited,to,handle,2D,inputs,In,this,paper,we,develop,a,novel,3D,CNN,model,for,action,recognition,This,model,extracts,fea-,tures,from,both,spatial,and,temporal,dimen-,sions,by,performing,3D,convolutions,thereby,capturing,the,motion,information,encoded,in,multiple,adjacent,frames,The,developed,model,generates,multiple,channels,of,infor-,mation,from,the,input,frames,and,the,final,feature,representation,is,obtained,by,com-,bining,information,from,all,channels,We,apply,the,developed,model,to,recognize,hu-,man,actions,in,real-world,environment,and,it,achieves,superior,performance,without,re-,lying,on,handcrafted,features,1,Introduction,Recognizing,human,actions,in,real-world,environment,finds,applications,in,a,variety,of,domains,including,in-,telligent,video,surveillance,customer,attributes,and,shopping,behavior,analysis,However,accurate,recog-,nition,of,actions,is,a,highly,challenging,task,due,to,Appearing,in,Proceedings,of,the,27,th,International,Confer-,ence,on,Machine,Learning,Haifa,Israel,2010,Copyright,2010,by,the,author(s)/owner(s),95014,USA,cluttered,backgrounds,occlusions,and,viewpoint,vari-,ations,etc,Therefore,most,of,the,existing,approaches,(Efros,et,al,2003,Schu ̈ldt,et,al,2004,Dolla ́r,et,al,2005,Laptev,&,P ́erez,2007,Jhuang,et,al,2007),make,certain,assumptions,(e,g,small,scale,and,view-,point,changes),about,the,circumstances,under,which,the,video,was,taken,However,such,assumptions,sel-,dom,hold,in,real-world,environment,In,addition,most,of,these,approaches,follow,the,conventional,paradigm,of,pattern,recognition,which,consists,of,two,steps,in,which,the,first,step,computes,complex,handcrafted,fea-,tures,from,raw,video,frames,and,the,second,step,learns,classifiers,based,on,the,obtained,features,In,real-world,scenarios,it,is,rarely,known,which,features,are,impor-,tant,for,the,task,at,hand,since,the,choice,of,feature,is,highly,problem-dependent,Especially,for,human,ac-,tion,recognition,different,action,classes,may,appear,dramatically,different,in,terms,of,their,appearances,and,motion,patterns,Deep,learning,models,(Fukushima,1980,LeCun,et,al,1998,Hinton,&,Salakhutdinov,2006,Hinton,et,al,2006,Bengio,2009),are,a,class,of,machines,that,can,learn,a,hierarchy,of,features,by,building,high-level,features,from,low-level,ones,thereby,automating,the,process,of,feature,construction,Such,learning,ma-,chines,can,be,trained,using,either,supervised,or,un-,supervised,approaches,and,the,resulting,systems,have,been,shown,to,yield,competitive,performance,in,visual,object,recognition,(LeCun,et,al,1998,Hinton,et,al,2006,Ranzato,et,al,2007,Lee,et,al,2009a),natu-,ral,language,processing,(Collobert,&,Weston,2008),and,audio,classification,(Lee,et,al,2009b),tasks,The,convolutional,neural,networks,(CNNs),(LeCun,et,al,1998),are,a,type,of,deep,models,in,which,trainable,filters,and,local,neighborhood,pooling,operations,are,applied,alternatingly,on,the,raw,input,images,result-,ing,in,a,hierarchy,of,increasingly,complex,features,It,has,been,shown,that,when,trained,with,appropri-,3D,Convolutional,Neural,Networks,for,Human,Action,Recognition,ate,regularization,(Ahmed,et,al,2008,Yu,et,al,2008,Mobahi,et,al,2009),CNNs,can,achieve,superior,per-,formance,on,visual,object,recognition,tasks,without,relying,on,handcrafted,features,In,addition,CNNs,have,been,shown,to,be,relatively,insensitive,to,certain,variations,on,the,inputs,(LeCun,et,al,2004),As,a,class,of,attractive,deep,models,for,automated,fea-,ture,construction,CNNs,have,been,primarily,applied,on,2D,images,In,this,paper,we,consider,the,use,of,CNNs,for,human,action,recognition,in,videos,A,sim-,ple,approach,in,this,direction,is,to,treat,video,frames,as,still,images,and,apply,CNNs,to,recognize,actions,at,the,individual,frame,level,Indeed,this,approach,has,been,used,to,analyze,the,videos,of,developing,embryos,(Ning,et,al,2005),However,such,approach,does,not,consider,the,motion,information,encoded,in,multiple,contiguous,frames,To,effectively,incorporate,the,motion,information,in,video,analysis,we,propose,to,perform,3D,convolution,in,the,convolutional,layers,of,CNNs,so,that,discriminative,features,along,both,spatial,and,temporal,dimensions,are,captured,We,show,that,by,applying,multiple,distinct,convolutional,operations,at,the,same,location,on,the,input,multi-,ple,types,of,features,can,be,extracted,Based,on,the,proposed,3D,convolution,a,variety,of,3D,CNN,archi-,tectures,can,be,devised,to,analyze,video,data,We,develop,a,3D,CNN,architecture,that,generates,multi-,ple,channels,of,information,from,adjacent,video,frames,and,performs,convolution,and,subsampling,separately,in,each,channel,The,final,feature,representation,is,obtained,by,combining,information,from,all,channels,An,additional,advantage,of,the,CNN-based,models,is,that,the,recognition,phase,is,very,efficient,due,to,their,feed-forward,nature,We,evaluated,the,developed,3D,CNN,model,on,the,TREC,Video,Retrieval,Evaluation,(TRECVID),data1,which,consist,of,surveillance,video,data,recorded,in,London,Gatwick,Airport,We,constructed,a,multi-,module,event,detection,system,which,includes,3D,CNN,as,a,module,and,participated,in,three,tasks,of,the,TRECVID,2009,Evaluation,for,Surveillance,Event,Detection,Our,system,achieved,the,best,performance,on,all,three,participated,tasks,To,provide,indepen-,dent,evaluation,of,the,3D,CNN,model,we,report,its,performance,on,the,TRECVID,2008,development,set,in,this,paper,We,also,present,results,on,the,KTH,data,as,published,performance,for,this,data,is,avail-,able,Our,experiments,show,that,the,developed,3D,CNN,model,outperforms,other,baseline,methods,on,the,TRECVID,data,and,it,achieves,competitive,per-,formance,on,the,KTH,data,without,depending,on,against-all,linear,SVM,is,learned,for,each,action,class,Specifically,we,extract,dense,SIFT,descriptors,(Lowe,2004),from,raw,gray,images,or,motion,edge,history,images,(MEHI),(Yang,et,al,2009),Local,features,on,raw,gray,images,preserve,the,appearance,information,while,MEHI,concerns,with,the,shape,and,motion,pat-,terns,These,SIFT,descriptors,are,calculated,every,6,pixels,from,7,×,7,and,16,×,16,local,image,patches,in,the,same,cubes,as,in,the,3D,CNN,model,Then,they,are,softly,quantized,using,a,512-word,codebook,to,build,the,BoW,features,To,exploit,the,spatial,layout,in-,formation,we,employ,similar,approach,as,the,spatial,pyramid,matching,(SPM),(Lazebnik,et,al,2006),to,partition,the,candidate,region,into,2,×,2,and,3,×,4,cells,and,concatenate,their,BoW,features,The,dimension-,ality,of,the,entire,feature,vector,is,512×(2×2+3×4),=,8192,We,denote,the,method,based,on,gray,images,as,SPMcube,and,the,one,based,on,MEHI,as,SPMcube,gray,MEHI,We,report,the,5-fold,cross-validation,results,in,which,the,data,for,a,single,day,are,used,as,a,fold,The,per-,formance,measures,we,used,are,precision,recall,and,area,under,the,ROC,curve,(ACU),at,multiple,values,of,FALSE,positive,rates,(FPR),The,performance,of,the,four,methods,is,summarized,in,Table,2,We,can,observe,from,Table,2,that,the,3D,CNN,model,outperforms,the,frame-based,2D,CNN,model,SPMcube,and,SPMcube,gray,MEHI,significantly,on,the,action,classes,CellToEar,and,Ob-,jectPut,in,all,cases,For,the,action,class,Pointing,3D,CNN,model,achieves,slightly,worse,performance,than,the,other,three,methods,From,Table,1,we,can,see,that,the,number,of,positive,samples,in,the,Pointing,class,is,significantly,larger,than,those,of,the,other,two,classes,Hence,we,can,conclude,that,the,3D,CNN,model,is,more,effective,when,the,number,of,positive,samples,is,small,Overall,the,3D,CNN,model,outperforms,other,three,methods,consistently,as,can,be,seen,from,the,average,performance,in,Table,2,4,2,Action,Recognition,on,KTH,Data,We,evaluate,the,3D,CNN,model,on,the,KTH,data,(Schu ̈ldt,et,al,2004),which,consist,of,6,action,classes,performed,by,25,subjects,To,follow,the,setup,in,the,HMAX,model,we,use,a,9-frame,cube,as,input,and,ex-,tract,foreground,as,in,(Jhuang,et,al,2007),To,reduce,the,memory,requirement,the,resolutions,of,the,input,frames,are,reduced,to,80,×,60,in,our,experiments,as,compared,to,160,×,120,used,in,(Jhuang,et,al,2007),We,use,a,similar,3D,CNN,architecture,as,in,Figure,3,with,the,sizes,of,kernels,and,the,number,of,feature,maps,in,each,layer,modified,to,consider,the,80,×,60,×,9,inputs,In,particular,the,three,convolutional,layers,use,kernels,of,sizes,9×7,7×7,and,6×4,respec-,tively,and,the,two,subsampling,layers,use,kernels,of,size,3×3,By,using,this,setting,the,80×60×9,in-,puts,are,converted,into,128D,feature,vectors,The,final,layer,consists,of,6,units,corresponding,to,the,6,classes,As,in,(Jhuang,et,al,2007),we,use,the,data,for,16,ran-,domly,selected,subjects,for,training,and,the,data,for,the,other,9,subjects,for,testing,The,recognition,per-,formance,averaged,across,5,random,trials,is,reported,in,Table,3,along,with,published,results,in,the,litera-,ture,The,3D,CNN,model,achieves,an,overall,accu-,racy,of,90,2%,as,compared,with,91,7%,achieved,by,the,HMAX,model,Note,that,the,HMAX,model,use,handcrafted,features,computed,from,raw,images,with,4-fold,higher,resolution,5,Conclusions,and,Discussions,We,developed,a,3D,CNN,model,for,action,recognition,in,this,paper,This,model,construct,features,from,both,spatial,and,temporal,dimensions,by,performing,3D,convolutions,The,developed,deep,architecture,gener-,ates,multiple,channels,of,information,from,adjacent,in-,put,frames,and,perform,convolution,and,subsampling,separately,in,each,channel,The,final,feature,represen-,tation,is,computed,by,combining,information,from,all,channels,We,evaluated,the,3D,CNN,model,using,the,TRECVID,and,the,KTH,data,sets,Results,show,that,the,3D,CNN,model,outperforms,compared,methods,on,the,TRECVID,data,while,it,achieves,competitive,performance,on,the,KTH,data,demonstrating,its,su-,perior,performance,in,real-world,environments
|