wangwei
/
DataCollector

# **DataCollector**

The DataCollector project mainly introduces training dataset resources, data cleaning and filtering methods of NLP pre-training model.
<!-- 
 [[cleaning and filtering methods of web page data](#cleaning and filtering methods of web page data)] -->
<!--  -->
<!-- # Common crawl data format

# cleaning and filtering methods of web page data

- (1) Classification and filtering of data in different languages
- (2) Rule based filtering
- (3) Garbage data filtering based on classification model
- (4) Data De duplication method

# Run -->
# Common crawl data format

Common crawl provides a free database containing tens of billions of web page data, and hopes that this service can support more research or online services. Common crawl raw data includes three formats: WARC, WAT and WET.

- WARC: WARC format is the raw data captured and provides direct mapping to the capture process. This format not only stores the HTTP response (WARC type: response) of the website it contacts, but also stores information about how the information is requested (WARC type: request) and metadata of the capture process itself (WARC type: metadata).

- WAT: the WAT format is stored in JSON format, and the WAT file contains important metadata about the records stored in WARC format above. The metadata is calculated for each of the three types of records (metadata, request, and response). If the crawled information is HTML, the calculated metadata includes the returned HTTP header and the links listed on the page (including the type of link).

- WET: because many tasks only need text information, the common crawl data set provides a wet file containing only the extracted plaintext. The method of storing text data in WET format is very simple. WARC metadata contains various details, including the URL and the length of plaintext data, followed by plaintext data.

# Cleaning and filtering method of raw data of common crawl WET format

## Classification and filtering of different languages data

Through the Unicode coding range of different languages, web pages in corresponding languages can be quickly filtered out. In the actual data extraction, we may also need to consider the coding format of the original data, web page structure and other factors, and construct  black-and-white list of web page to improve the efficiency and quality of data extraction. In addition, text classification model and other methods can also be considered for filtering.

## Rule based filtering

Common crawl data contains many types of garbage, such as special symbols, advertisements and web page titles. Building corresponding data cleaning rules through data characteristics can improve data quality.

## Garbage data filtering based on classification model

The text data formed through the above two steps often still contains a large amount of sensitive, Pornographic information, advertising and other information. We filter the sensitive, Pornographic information, advertising and other text information through the text classification model and keywords based on fasttext.

## Large scale data De duplication method

The text is completely repeated within and between different web pages. Based on Hadoop + spark platform, hashtf + minihashlsh algorithm is used to remove the text duplication of data in paragraph granularity. At present, we are summarizing this part of work and writing patents. After the patent is submitted, we will open source this part of the global de-duplication code based on spark big data platform. At present, only the de duplication code in the open source document.

# Run

## 1. Language filtering, rule cleaning, sensitive word filtering
```
python cc_cleaner.py
```
## 2. Fasttext garbage classification
```
python Filter_run.py
```
## 3. Weight removal
```
python dedup_simhash.py
```