This website works better with JavaScript.
Home
Issues
Pull Requests
Milestones
AI流水线
Repositories
Datasets
Forum
实训
竞赛
大数据
AI开发
Register
Sign In
wangwei
/
DataCollector
Not watched
Unwatch
Watch all
Watch but not notify
1
Star
1
Fork
0
Code
Releases
0
Wiki
Activity
Issues
573
Pull Requests
0
Datasets
Model
Cloudbrain
Browse Source
更新 'README-en.md'
master
JinWang
4 years ago
parent
10a04ddfb9
commit
0866c7818b
1 changed files
with
3 additions
and
3 deletions
Split View
Diff Options
Show Stats
Download Patch File
Download Diff File
+3
-3
README-en.md
+ 3
- 3
README-en.md
View File
@@ -49,13 +49,13 @@ The text is completely repeated within and between different web pages. Based on
## 1. Language filtering, rule cleaning, sensitive word filtering
```
python cc_
cleaner.py
python cc_cleaner.py
```
## 2. Fasttext garbage classification
```
python Filter_
run.py
python Filter_run.py
```
## 3. Weight removal
```
python dedup_
simhash.py
python dedup_simhash.py
```
Write
Preview
Loading…
Cancel
Save