Browse Source

更新 'README-en.md'

master
JinWang 4 years ago
parent
commit
00534fa056
1 changed files with 5 additions and 8 deletions
  1. +5
    -8
      README-en.md

+ 5
- 8
README-en.md View File

@@ -1,10 +1,10 @@
# **DataCollector**

The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model.
[[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)]
# Common crawl data format
<!--
[[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] -->
<!-- -->
<!-- # Common crawl data format

# cleaning and filtering methods of web page data

@@ -13,10 +13,7 @@ The DataCollector project mainly introduces training dataset resources, data cle
- (3) Garbage data filtering based on classification model
- (4) Data De duplication method

# Run



# Run -->
# Common crawl data format

Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET.


Loading…
Cancel
Save