更新 'README-en.md'

4 years ago · 00534fa056
--- a/README-en.md
+++ b/README-en.md
@@ -1,10 +1,10 @@
 # **DataCollector**

 The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model.

 [[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)]

 # Common crawl data format
 <!-- 
 [[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] -->
 <!--  -->
 <!-- # Common crawl data format

 # cleaning and filtering methods of web page data

@@ -13,10 +13,7 @@ The DataCollector project mainly introduces training dataset resources, data cle
 - (3) Garbage data filtering based on classification model
 - (4) Data De duplication method

 # Run



 # Run -->
 # Common crawl data format

 Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET.