Browse Source

更新 'README-en.md'

master
JinWang 4 years ago
parent
commit
00534fa056
1 changed files with 5 additions and 8 deletions
  1. +5
    -8
      README-en.md

+ 5
- 8
README-en.md View File

@@ -1,10 +1,10 @@
# **DataCollector** # **DataCollector**


The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model. The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model.
[[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)]
# Common crawl data format
<!--
[[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] -->
<!-- -->
<!-- # Common crawl data format


# cleaning and filtering methods of web page data # cleaning and filtering methods of web page data


@@ -13,10 +13,7 @@ The DataCollector project mainly introduces training dataset resources, data cle
- (3) Garbage data filtering based on classification model - (3) Garbage data filtering based on classification model
- (4) Data De duplication method - (4) Data De duplication method


# Run



# Run -->
# Common crawl data format # Common crawl data format


Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET. Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET.


Loading…
Cancel
Save