| @@ -1,10 +1,10 @@ | |||||
| # **DataCollector** | # **DataCollector** | ||||
| The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model. | The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model. | ||||
| [[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] | |||||
| # Common crawl data format | |||||
| <!-- | |||||
| [[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] --> | |||||
| <!-- --> | |||||
| <!-- # Common crawl data format | |||||
| # cleaning and filtering methods of web page data | # cleaning and filtering methods of web page data | ||||
| @@ -13,10 +13,7 @@ The DataCollector project mainly introduces training dataset resources, data cle | |||||
| - (3) Garbage data filtering based on classification model | - (3) Garbage data filtering based on classification model | ||||
| - (4) Data De duplication method | - (4) Data De duplication method | ||||
| # Run | |||||
| # Run --> | |||||
| # Common crawl data format | # Common crawl data format | ||||
| Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET. | Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET. | ||||