You can not select more than 25 topics
Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.
| We | consider | the | fully | automated | recognition | of | actions | in | uncontrolled | environment | Most | existing | work | relies | on | domain | knowledge | to | construct | complex | handcrafted | features | from | inputs | In | addition | the | environments | are | usually | assumed | to | be | controlled | Convolu- | tional | neural | networks | (CNNs) | are | a | type | of | deep | models | that | can | act | directly | on | the | raw | inputs | thus | automating | the | process | of | fea- | ture | construction | However | such | models | are | currently | limited | to | handle | 2D | inputs | In | this | paper | we | develop | a | novel | 3D | CNN | model | for | action | recognition | This | model | extracts | fea- | tures | from | both | spatial | and | temporal | dimen- | sions | by | performing | 3D | convolutions | thereby | capturing | the | motion | information | encoded | in | multiple | adjacent | frames | The | developed | model | generates | multiple | channels | of | infor- | mation | from | the | input | frames | and | the | final | feature | representation | is | obtained | by | com- | bining | information | from | all | channels | We | apply | the | developed | model | to | recognize | hu- | man | actions | in | real-world | environment | and | it | achieves | superior | performance | without | re- | lying | on | handcrafted | features | 1 | Introduction | Recognizing | human | actions | in | real-world | environment | finds | applications | in | a | variety | of | domains | including | in- | telligent | video | surveillance | customer | attributes | and | shopping | behavior | analysis | However | accurate | recog- | nition | of | actions | is | a | highly | challenging | task | due | to | Appearing | in | Proceedings | of | the | 27 | th | International | Confer- | ence | on | Machine | Learning | Haifa | Israel | 2010 | Copyright | 2010 | by | the | author(s)/owner(s) | 95014 | USA | cluttered | backgrounds | occlusions | and | viewpoint | vari- | ations | etc | Therefore | most | of | the | existing | approaches | (Efros | et | al | 2003 | Schu ̈ldt | et | al | 2004 | Dolla ́r | et | al | 2005 | Laptev | & | P ́erez | 2007 | Jhuang | et | al | 2007) | make | certain | assumptions | (e | g | small | scale | and | view- | point | changes) | about | the | circumstances | under | which | the | video | was | taken | However | such | assumptions | sel- | dom | hold | in | real-world | environment | In | addition | most | of | these | approaches | follow | the | conventional | paradigm | of | pattern | recognition | which | consists | of | two | steps | in | which | the | first | step | computes | complex | handcrafted | fea- | tures | from | raw | video | frames | and | the | second | step | learns | classifiers | based | on | the | obtained | features | In | real-world | scenarios | it | is | rarely | known | which | features | are | impor- | tant | for | the | task | at | hand | since | the | choice | of | feature | is | highly | problem-dependent | Especially | for | human | ac- | tion | recognition | different | action | classes | may | appear | dramatically | different | in | terms | of | their | appearances | and | motion | patterns | Deep | learning | models | (Fukushima | 1980 | LeCun | et | al | 1998 | Hinton | & | Salakhutdinov | 2006 | Hinton | et | al | 2006 | Bengio | 2009) | are | a | class | of | machines | that | can | learn | a | hierarchy | of | features | by | building | high-level | features | from | low-level | ones | thereby | automating | the | process | of | feature | construction | Such | learning | ma- | chines | can | be | trained | using | either | supervised | or | un- | supervised | approaches | and | the | resulting | systems | have | been | shown | to | yield | competitive | performance | in | visual | object | recognition | (LeCun | et | al | 1998 | Hinton | et | al | 2006 | Ranzato | et | al | 2007 | Lee | et | al | 2009a) | natu- | ral | language | processing | (Collobert | & | Weston | 2008) | and | audio | classification | (Lee | et | al | 2009b) | tasks | The | convolutional | neural | networks | (CNNs) | (LeCun | et | al | 1998) | are | a | type | of | deep | models | in | which | trainable | filters | and | local | neighborhood | pooling | operations | are | applied | alternatingly | on | the | raw | input | images | result- | ing | in | a | hierarchy | of | increasingly | complex | features | It | has | been | shown | that | when | trained | with | appropri- | 3D | Convolutional | Neural | Networks | for | Human | Action | Recognition | ate | regularization | (Ahmed | et | al | 2008 | Yu | et | al | 2008 | Mobahi | et | al | 2009) | CNNs | can | achieve | superior | per- | formance | on | visual | object | recognition | tasks | without | relying | on | handcrafted | features | In | addition | CNNs | have | been | shown | to | be | relatively | insensitive | to | certain | variations | on | the | inputs | (LeCun | et | al | 2004) | As | a | class | of | attractive | deep | models | for | automated | fea- | ture | construction | CNNs | have | been | primarily | applied | on | 2D | images | In | this | paper | we | consider | the | use | of | CNNs | for | human | action | recognition | in | videos | A | sim- | ple | approach | in | this | direction | is | to | treat | video | frames | as | still | images | and | apply | CNNs | to | recognize | actions | at | the | individual | frame | level | Indeed | this | approach | has | been | used | to | analyze | the | videos | of | developing | embryos | (Ning | et | al | 2005) | However | such | approach | does | not | consider | the | motion | information | encoded | in | multiple | contiguous | frames | To | effectively | incorporate | the | motion | information | in | video | analysis | we | propose | to | perform | 3D | convolution | in | the | convolutional | layers | of | CNNs | so | that | discriminative | features | along | both | spatial | and | temporal | dimensions | are | captured | We | show | that | by | applying | multiple | distinct | convolutional | operations | at | the | same | location | on | the | input | multi- | ple | types | of | features | can | be | extracted | Based | on | the | proposed | 3D | convolution | a | variety | of | 3D | CNN | archi- | tectures | can | be | devised | to | analyze | video | data | We | develop | a | 3D | CNN | architecture | that | generates | multi- | ple | channels | of | information | from | adjacent | video | frames | and | performs | convolution | and | subsampling | separately | in | each | channel | The | final | feature | representation | is | obtained | by | combining | information | from | all | channels | An | additional | advantage | of | the | CNN-based | models | is | that | the | recognition | phase | is | very | efficient | due | to | their | feed-forward | nature | We | evaluated | the | developed | 3D | CNN | model | on | the | TREC | Video | Retrieval | Evaluation | (TRECVID) | data1 | which | consist | of | surveillance | video | data | recorded | in | London | Gatwick | Airport | We | constructed | a | multi- | module | event | detection | system | which | includes | 3D | CNN | as | a | module | and | participated | in | three | tasks | of | the | TRECVID | 2009 | Evaluation | for | Surveillance | Event | Detection | Our | system | achieved | the | best | performance | on | all | three | participated | tasks | To | provide | indepen- | dent | evaluation | of | the | 3D | CNN | model | we | report | its | performance | on | the | TRECVID | 2008 | development | set | in | this | paper | We | also | present | results | on | the | KTH | data | as | published | performance | for | this | data | is | avail- | able | Our | experiments | show | that | the | developed | 3D | CNN | model | outperforms | other | baseline | methods | on | the | TRECVID | data | and | it | achieves | competitive | per- | formance | on | the | KTH | data | without | depending | on | against-all | linear | SVM | is | learned | for | each | action | class | Specifically | we | extract | dense | SIFT | descriptors | (Lowe | 2004) | from | raw | gray | images | or | motion | edge | history | images | (MEHI) | (Yang | et | al | 2009) | Local | features | on | raw | gray | images | preserve | the | appearance | information | while | MEHI | concerns | with | the | shape | and | motion | pat- | terns | These | SIFT | descriptors | are | calculated | every | 6 | pixels | from | 7 | × | 7 | and | 16 | × | 16 | local | image | patches | in | the | same | cubes | as | in | the | 3D | CNN | model | Then | they | are | softly | quantized | using | a | 512-word | codebook | to | build | the | BoW | features | To | exploit | the | spatial | layout | in- | formation | we | employ | similar | approach | as | the | spatial | pyramid | matching | (SPM) | (Lazebnik | et | al | 2006) | to | partition | the | candidate | region | into | 2 | × | 2 | and | 3 | × | 4 | cells | and | concatenate | their | BoW | features | The | dimension- | ality | of | the | entire | feature | vector | is | 512×(2×2+3×4) | = | 8192 | We | denote | the | method | based | on | gray | images | as | SPMcube | and | the | one | based | on | MEHI | as | SPMcube | gray | MEHI | We | report | the | 5-fold | cross-validation | results | in | which | the | data | for | a | single | day | are | used | as | a | fold | The | per- | formance | measures | we | used | are | precision | recall | and | area | under | the | ROC | curve | (ACU) | at | multiple | values | of | FALSE | positive | rates | (FPR) | The | performance | of | the | four | methods | is | summarized | in | Table | 2 | We | can | observe | from | Table | 2 | that | the | 3D | CNN | model | outperforms | the | frame-based | 2D | CNN | model | SPMcube | and | SPMcube | gray | MEHI | significantly | on | the | action | classes | CellToEar | and | Ob- | jectPut | in | all | cases | For | the | action | class | Pointing | 3D | CNN | model | achieves | slightly | worse | performance | than | the | other | three | methods | From | Table | 1 | we | can | see | that | the | number | of | positive | samples | in | the | Pointing | class | is | significantly | larger | than | those | of | the | other | two | classes | Hence | we | can | conclude | that | the | 3D | CNN | model | is | more | effective | when | the | number | of | positive | samples | is | small | Overall | the | 3D | CNN | model | outperforms | other | three | methods | consistently | as | can | be | seen | from | the | average | performance | in | Table | 2 | 4 | 2 | Action | Recognition | on | KTH | Data | We | evaluate | the | 3D | CNN | model | on | the | KTH | data | (Schu ̈ldt | et | al | 2004) | which | consist | of | 6 | action | classes | performed | by | 25 | subjects | To | follow | the | setup | in | the | HMAX | model | we | use | a | 9-frame | cube | as | input | and | ex- | tract | foreground | as | in | (Jhuang | et | al | 2007) | To | reduce | the | memory | requirement | the | resolutions | of | the | input | frames | are | reduced | to | 80 | × | 60 | in | our | experiments | as | compared | to | 160 | × | 120 | used | in | (Jhuang | et | al | 2007) | We | use | a | similar | 3D | CNN | architecture | as | in | Figure | 3 | with | the | sizes | of | kernels | and | the | number | of | feature | maps | in | each | layer | modified | to | consider | the | 80 | × | 60 | × | 9 | inputs | In | particular | the | three | convolutional | layers | use | kernels | of | sizes | 9×7 | 7×7 | and | 6×4 | respec- | tively | and | the | two | subsampling | layers | use | kernels | of | size | 3×3 | By | using | this | setting | the | 80×60×9 | in- | puts | are | converted | into | 128D | feature | vectors | The | final | layer | consists | of | 6 | units | corresponding | to | the | 6 | classes | As | in | (Jhuang | et | al | 2007) | we | use | the | data | for | 16 | ran- | domly | selected | subjects | for | training | and | the | data | for | the | other | 9 | subjects | for | testing | The | recognition | per- | formance | averaged | across | 5 | random | trials | is | reported | in | Table | 3 | along | with | published | results | in | the | litera- | ture | The | 3D | CNN | model | achieves | an | overall | accu- | racy | of | 90 | 2% | as | compared | with | 91 | 7% | achieved | by | the | HMAX | model | Note | that | the | HMAX | model | use | handcrafted | features | computed | from | raw | images | with | 4-fold | higher | resolution | 5 | Conclusions | and | Discussions | We | developed | a | 3D | CNN | model | for | action | recognition | in | this | paper | This | model | construct | features | from | both | spatial | and | temporal | dimensions | by | performing | 3D | convolutions | The | developed | deep | architecture | gener- | ates | multiple | channels | of | information | from | adjacent | in- | put | frames | and | perform | convolution | and | subsampling | separately | in | each | channel | The | final | feature | represen- | tation | is | computed | by | combining | information | from | all | channels | We | evaluated | the | 3D | CNN | model | using | the | TRECVID | and | the | KTH | data | sets | Results | show | that | the | 3D | CNN | model | outperforms | compared | methods | on | the | TRECVID | data | while | it | achieves | competitive | performance | on | the | KTH | data | demonstrating | its | su- | perior | performance | in | real-world | environments |