You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 1.3 kB

123456789101112131415161718192021222324252627282930313233343536
  1. ## Introduction
  2. This is the implementation of [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) paper in PyTorch.
  3. * Dataset is 600k documents extracted from [Yelp 2018](https://www.yelp.com/dataset) customer reviews
  4. * Use [NLTK](http://www.nltk.org/) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) to tokenize documents and sentences
  5. * Both CPU & GPU support
  6. * The best accuracy is 71%, reaching the same performance in the paper
  7. ## Requirement
  8. * python 3.6
  9. * pytorch = 0.3.0
  10. * numpy
  11. * gensim
  12. * nltk
  13. * coreNLP
  14. ## Parameters
  15. According to the paper and experiment, I set model parameters:
  16. |word embedding dimension|GRU hidden size|GRU layer|word/sentence context vector dimension|
  17. |---|---|---|---|
  18. |200|50|1|100|
  19. And the training parameters:
  20. |Epoch|learning rate|momentum|batch size|
  21. |---|---|---|---|
  22. |3|0.01|0.9|64|
  23. ## Run
  24. 1. Prepare dataset. Download the [data set](https://www.yelp.com/dataset), and unzip the custom reviews as a file. Use preprocess.py to transform file into data set foe model input.
  25. 2. Train the model. Word enbedding of train data in 'yelp.word2vec'. The model will trained and autosaved in 'model.dict'
  26. ```
  27. python train
  28. ```
  29. 3. Test the model.
  30. ```
  31. python evaluate
  32. ```

一款轻量级的自然语言处理(NLP)工具包,目标是减少用户项目中的工程型代码,例如数据处理循环、训练循环、多卡运行等