diff --git a/README-en.md b/README-en.md index 0cfe0b20..9e1ff748 100644 --- a/README-en.md +++ b/README-en.md @@ -1,10 +1,10 @@ # **DataCollector** The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model. - - [[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] - -# Common crawl data format + + + # Common crawl data format Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET.