From 00534fa056e69ffaa4a0c302d3031199812047d5 Mon Sep 17 00:00:00 2001 From: JinWang Date: Mon, 1 Nov 2021 10:46:42 +0800 Subject: [PATCH] =?UTF-8?q?=E6=9B=B4=E6=96=B0=20'README-en.md'?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README-en.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/README-en.md b/README-en.md index 0cfe0b20..9e1ff748 100644 --- a/README-en.md +++ b/README-en.md @@ -1,10 +1,10 @@ # **DataCollector** The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model. - - [[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] - -# Common crawl data format + + + # Common crawl data format Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET.