|
|
|
@@ -24,7 +24,7 @@ Common crawl provides a free database containing tens of billions of web page da |
|
|
|
|
|
|
|
- WET: because many tasks only need text information, the common crawl data set provides a wet file containing only the extracted plaintext. The method of storing text data in WET format is very simple. WARC metadata contains various details, including the URL and the length of plaintext data, followed by plaintext data. |
|
|
|
|
|
|
|
# Cleaning and filtering method of raw data based on common crawl wet format |
|
|
|
# Cleaning and filtering method of raw data of common crawl WET format |
|
|
|
|
|
|
|
## Classification and filtering of different languages data |
|
|
|
|
|
|
|
|