The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model.
[[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)]
# Common crawl data format
<!--
[[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] -->
<!-- -->
<!-- # Common crawl data format
# cleaning and filtering methods of web page data
@@ -13,10 +13,7 @@ The DataCollector project mainly introduces training dataset resources, data cle
- (3) Garbage data filtering based on classification model
- (4) Data De duplication method
# Run
# Run -->
# Common crawl data format
Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET.