Browse Source

更新 'README-en.md'

master
JinWang 4 years ago
parent
commit
9de920f401
1 changed files with 1 additions and 1 deletions
  1. +1
    -1
      README-en.md

+ 1
- 1
README-en.md View File

@@ -24,7 +24,7 @@ Common crawl provides a free database containing tens of billions of web page da


- WET: because many tasks only need text information, the common crawl data set provides a wet file containing only the extracted plaintext. The method of storing text data in WET format is very simple. WARC metadata contains various details, including the URL and the length of plaintext data, followed by plaintext data. - WET: because many tasks only need text information, the common crawl data set provides a wet file containing only the extracted plaintext. The method of storing text data in WET format is very simple. WARC metadata contains various details, including the URL and the length of plaintext data, followed by plaintext data.


# Cleaning and filtering method of raw data based on common crawl wet format
# Cleaning and filtering method of raw data of common crawl WET format


## Classification and filtering of different languages data ## Classification and filtering of different languages data




Loading…
Cancel
Save