You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 1.3 kB

123456789101112131415161718192021222324252627282930313233343536
  1. ## Introduction
  2. This is the implementation of [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) paper in PyTorch.
  3. * Dataset is 600k documents extracted from [Yelp 2018](https://www.yelp.com/dataset) customer reviews
  4. * Use [NLTK](http://www.nltk.org/) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) to tokenize documents and sentences
  5. * Both CPU & GPU support
  6. * The best accuracy is 71%, reaching the same performance in the paper
  7. ## Requirement
  8. * python 3.6
  9. * pytorch = 0.3.0
  10. * numpy
  11. * gensim
  12. * nltk
  13. * coreNLP
  14. ## Parameters
  15. According to the paper and experiment, I set model parameters:
  16. |word embedding dimension|GRU hidden size|GRU layer|word/sentence context vector dimension|
  17. |---|---|---|---|
  18. |200|50|1|100|
  19. And the training parameters:
  20. |Epoch|learning rate|momentum|batch size|
  21. |---|---|---|---|
  22. |3|0.01|0.9|64|
  23. ## Run
  24. 1. Prepare dataset. Download the [data set](https://www.yelp.com/dataset), and unzip the custom reviews as a file. Use preprocess.py to transform file into data set foe model input.
  25. 2. Train the model. Word enbedding of train data in 'yelp.word2vec'. The model will trained and autosaved in 'model.dict'
  26. ```
  27. python train
  28. ```
  29. 3. Test the model.
  30. ```
  31. python evaluate
  32. ```