Edgedev/Edge-Computing-Engine: Edge : 一个开源的科学计算引擎 - mytest.csv at master - Edge-Computing-Engine - 开源协同云脑生态支撑系统

10 kB

Raw Permalink Blame History

We

consider

the

fully

automated

recognition

of

actions

in

uncontrolled

environment

Most

existing

work

relies

on

domain

knowledge

to

construct

complex

handcrafted

features

from

inputs

In

addition

the

environments

are

usually

assumed

to

be

controlled

Convolu-

tional

neural

networks

(CNNs)

are

a

type

of

deep

models

that

can

act

directly

on

the

raw

inputs

thus

automating

the

process

of

fea-

ture

construction

However

such

models

are

currently

limited

to

handle

2D

inputs

In

this

paper

we

develop

a

novel

3D

CNN

model

for

action

recognition

This

model

extracts

fea-

tures

from

both

spatial

and

temporal

dimen-

sions

by

performing

3D

convolutions

thereby

capturing

the

motion

information

encoded

in

multiple

adjacent

frames

The

developed

model

generates

multiple

channels

of

infor-

mation

from

the

input

frames

and

the

final

feature

representation

is

obtained

by

com-

bining

information

from

all

channels

We

apply

the

developed

model

to

recognize

hu-

man

actions

in

real-world

environment

and

it

achieves

superior

performance

without

re-

lying

on

handcrafted

features

1

Introduction

Recognizing

human

actions

in

real-world

environment

finds

applications

in

a

variety

of

domains

including

in-

telligent

video

surveillance

customer

attributes

and

shopping

behavior

analysis

However

accurate

recog-

nition

of

actions

is

a

highly

challenging

task

due

to

Appearing

in

Proceedings

of

the

27

th

International

Confer-

ence

on

Machine

Learning

Haifa

Israel

2010

Copyright

2010

by

the

author(s)/owner(s)

95014

USA

cluttered

backgrounds

occlusions

and

viewpoint

vari-

ations

etc

Therefore

most

of

the

existing

approaches

(Efros

et

al

2003

Schu ̈ldt

et

al

2004

Dolla ́r

et

al

2005

Laptev

&

P ́erez

2007

Jhuang

et

al

2007)

make

certain

assumptions

(e

g

small

scale

and

view-

point

changes)

about

the

circumstances

under

which

the

video

was

taken

However

such

assumptions

sel-

dom

hold

in

real-world

environment

In

addition

most

of

these

approaches

follow

the

conventional

paradigm

of

pattern

recognition

which

consists

of

two

steps

in

which

the

first

step

computes

complex

handcrafted

fea-

tures

from

raw

video

frames

and

the

second

step

learns

classifiers

based

on

the

obtained

features

In

real-world

scenarios

it

is

rarely

known

which

features

are

impor-

tant

for

the

task

at

hand

since

the

choice

of

feature

is

highly

problem-dependent

Especially

for

human

ac-

tion

recognition

different

action

classes

may

appear

dramatically

different

in

terms

of

their

appearances

and

motion

patterns

Deep

learning

models

(Fukushima

1980

LeCun

et

al

1998

Hinton

&

Salakhutdinov

2006

Hinton

et

al

2006

Bengio

2009)

are

a

class

of

machines

that

can

learn

a

hierarchy

of

features

by

building

high-level

features

from

low-level

ones

thereby

automating

the

process

of

feature

construction

Such

learning

ma-

chines

can

be

trained

using

either

supervised

or

un-

supervised

approaches

and

the

resulting

systems

have

been

shown

to

yield

competitive

performance

in

visual

object

recognition

(LeCun

et

al

1998

Hinton

et

al

2006

Ranzato

et

al

2007

Lee

et

al

2009a)

natu-

ral

language

processing

(Collobert

&

Weston

2008)

and

audio

classification

(Lee

et

al

2009b)

tasks

The

convolutional

neural

networks

(CNNs)

(LeCun

et

al

1998)

are

a

type

of

deep

models

in

which

trainable

filters

and

local

neighborhood

pooling

operations

are

applied

alternatingly

on

the

raw

input

images

result-

ing

in

a

hierarchy

of

increasingly

complex

features

It

has

been

shown

that

when

trained

with

appropri-

3D

Convolutional

Neural

Networks

for

Human

Action

Recognition

ate

regularization

(Ahmed

et

al

2008

Yu

et

al

2008

Mobahi

et

al

2009)

CNNs

can

achieve

superior

per-

formance

on

visual

object

recognition

tasks

without

relying

on

handcrafted

features

In

addition

CNNs

have

been

shown

to

be

relatively

insensitive

to

certain

variations

on

the

inputs

(LeCun

et

al

2004)

As

a

class

of

attractive

deep

models

for

automated

fea-

ture

construction

CNNs

have

been

primarily

applied

on

2D

images

In

this

paper

we

consider

the

use

of

CNNs

for

human

action

recognition

in

videos

A

sim-

ple

approach

in

this

direction

is

to

treat

video

frames

as

still

images

and

apply

CNNs

to

recognize

actions

at

the

individual

frame

level

Indeed

this

approach

has

been

used

to

analyze

the

videos

of

developing

embryos

(Ning

et

al

2005)

However

such

approach

does

not

consider

the

motion

information

encoded

in

multiple

contiguous

frames

To

effectively

incorporate

the

motion

information

in

video

analysis

we

propose

to

perform

3D

convolution

in

the

convolutional

layers

of

CNNs

so

that

discriminative

features

along

both

spatial

and

temporal

dimensions

are

captured

We

show

that

by

applying

multiple

distinct

convolutional

operations

at

the

same

location

on

the

input

multi-

ple

types

of

features

can

be

extracted

Based

on

the

proposed

3D

convolution

a

variety

of

3D

CNN

archi-

tectures

can

be

devised

to

analyze

video

data

We

develop

a

3D

CNN

architecture

that

generates

multi-

ple

channels

of

information

from

adjacent

video

frames

and

performs

convolution

and

subsampling

separately

in

each

channel

The

final

feature

representation

is

obtained

by

combining

information

from

all

channels

An

additional

advantage

of

the

CNN-based

models

is

that

the

recognition

phase

is

very

efficient

due

to

their

feed-forward

nature

We

evaluated

the

developed

3D

CNN

model

on

the

TREC

Video

Retrieval

Evaluation

(TRECVID)

data1

which

consist

of

surveillance

video

data

recorded

in

London

Gatwick

Airport

We

constructed

a

multi-

module

event

detection

system

which

includes

3D

CNN

as

a

module

and

participated

in

three

tasks

of

the

TRECVID

2009

Evaluation

for

Surveillance

Event

Detection

Our

system

achieved

the

best

performance

on

all

three

participated

tasks

To

provide

indepen-

dent

evaluation

of

the

3D

CNN

model

we

report

its

performance

on

the

TRECVID

2008

development

set

in

this

paper

We

also

present

results

on

the

KTH

data

as

published

performance

for

this

data

is

avail-

able

Our

experiments

show

that

the

developed

3D

CNN

model

outperforms

other

baseline

methods

on

the

TRECVID

data

and

it

achieves

competitive

per-

formance

on

the

KTH

data

without

depending

on

against-all

linear

SVM

is

learned

for

each

action

class

Specifically

we

extract

dense

SIFT

descriptors

(Lowe

2004)

from

raw

gray

images

or

motion

edge

history

images

(MEHI)

(Yang

et

al

2009)

Local

features

on

raw

gray

images

preserve

the

appearance

information

while

MEHI

concerns

with

the

shape

and

motion

pat-

terns

These

SIFT

descriptors

are

calculated

every

6

pixels

from

7

×

7

and

16

×

16

local

image

patches

in

the

same

cubes

as

in

the

3D

CNN

model

Then

they

are

softly

quantized

using

a

512-word

codebook

to

build

the

BoW

features

To

exploit

the

spatial

layout

in-

formation

we

employ

similar

approach

as

the

spatial

pyramid

matching

(SPM)

(Lazebnik

et

al

2006)

to

partition

the

candidate

region

into

2

×

2

and

3

×

4

cells

and

concatenate

their

BoW

features

The

dimension-

ality

of

the

entire

feature

vector

is

512×(2×2+3×4)

=

8192

We

denote

the

method

based

on

gray

images

as

SPMcube

and

the

one

based

on

MEHI

as

SPMcube

gray

MEHI

We

report

the

5-fold

cross-validation

results

in

which

the

data

for

a

single

day

are

used

as

a

fold

The

per-

formance

measures

we

used

are

precision

recall

and

area

under

the

ROC

curve

(ACU)

at

multiple

values

of

FALSE

positive

rates

(FPR)

The

performance

of

the

four

methods

is

summarized

in

Table

2

We

can

observe

from

Table

2

that

the

3D

CNN

model

outperforms

the

frame-based

2D

CNN

model

SPMcube

and

SPMcube

gray

MEHI

significantly

on

the

action

classes

CellToEar

and

Ob-

jectPut

in

all

cases

For

the

action

class

Pointing

3D

CNN

model

achieves

slightly

worse

performance

than

the

other

three

methods

From

Table

1

we

can

see

that

the

number

of

positive

samples

in

the

Pointing

class

is

significantly

larger

than

those

of

the

other

two

classes

Hence

we

can

conclude

that

the

3D

CNN

model

is

more

effective

when

the

number

of

positive

samples

is

small

Overall

the

3D

CNN

model

outperforms

other

three

methods

consistently

as

can

be

seen

from

the

average

performance

in

Table

2

4

2

Action

Recognition

on

KTH

Data

We

evaluate

the

3D

CNN

model

on

the

KTH

data

(Schu ̈ldt

et

al

2004)

which

consist

of

6

action

classes

performed

by

25

subjects

To

follow

the

setup

in

the

HMAX

model

we

use

a

9-frame

cube

as

input

and

ex-

tract

foreground

as

in

(Jhuang

et

al

2007)

To

reduce

the

memory

requirement

the

resolutions

of

the

input

frames

are

reduced

to

80

×

60

in

our

experiments

as

compared

to

160

×

120

used

in

(Jhuang

et

al

2007)

We

use

a

similar

3D

CNN

architecture

as

in

Figure

3

with

the

sizes

of

kernels

and

the

number

of

feature

maps

in

each

layer

modified

to

consider

the

80

×

60

×

9

inputs

In

particular

the

three

convolutional

layers

use

kernels

of

sizes

9×7

7×7

and

6×4

respec-

tively

and

the

two

subsampling

layers

use

kernels

of

size

3×3

By

using

this

setting

the

80×60×9

in-

puts

are

converted

into

128D

feature

vectors

The

final

layer

consists

of

6

units

corresponding

to

the

6

classes

As

in

(Jhuang

et

al

2007)

we

use

the

data

for

16

ran-

domly

selected

subjects

for

training

and

the

data

for

the

other

9

subjects

for

testing

The

recognition

per-

formance

averaged

across

5

random

trials

is

reported

in

Table

3

along

with

published

results

in

the

litera-

ture

The

3D

CNN

model

achieves

an

overall

accu-

racy

of

90

2%

as

compared

with

91

7%

achieved

by

the

HMAX

model

Note

that

the

HMAX

model

use

handcrafted

features

computed

from

raw

images

with

4-fold

higher

resolution

5

Conclusions

and

Discussions

We

developed

a

3D

CNN

model

for

action

recognition

in

this

paper

This

model

construct

features

from

both

spatial

and

temporal

dimensions

by

performing

3D

convolutions

The

developed

deep

architecture

gener-

ates

multiple

channels

of

information

from

adjacent

in-

put

frames

and

perform

convolution

and

subsampling

separately

in

each

channel

The

final

feature

represen-

tation

is

computed

by

combining

information

from

all

channels

We

evaluated

the

3D

CNN

model

using

the

TRECVID

and

the

KTH

data

sets

Results

show

that

the

3D

CNN

model

outperforms

compared

methods

on

the

TRECVID

data

while

it

achieves

competitive

performance

on

the

KTH

data

demonstrating

its

su-

perior

performance

in

real-world

environments