|
|
@@ -49,13 +49,13 @@ The text is completely repeated within and between different web pages. Based on |
|
|
|
|
|
|
|
|
## 1. Language filtering, rule cleaning, sensitive word filtering |
|
|
## 1. Language filtering, rule cleaning, sensitive word filtering |
|
|
``` |
|
|
``` |
|
|
python cc_ cleaner.py |
|
|
|
|
|
|
|
|
python cc_cleaner.py |
|
|
``` |
|
|
``` |
|
|
## 2. Fasttext garbage classification |
|
|
## 2. Fasttext garbage classification |
|
|
``` |
|
|
``` |
|
|
python Filter_ run.py |
|
|
|
|
|
|
|
|
python Filter_run.py |
|
|
``` |
|
|
``` |
|
|
## 3. Weight removal |
|
|
## 3. Weight removal |
|
|
``` |
|
|
``` |
|
|
python dedup_ simhash.py |
|
|
|
|
|
|
|
|
python dedup_simhash.py |
|
|
``` |
|
|
``` |