A flexible Python script to crawl posts from DCInside galleries and save the data to an Excel file.
- Crawl Any Gallery: Works with regular, minor (
mgallery), and mini (mini) galleries. - Page Range Selection: Specify exactly which pages you want to crawl (e.g., pages 1 through 5).
- Keyword Filtering: Filter posts by a specific keyword in the title.
- Data Cleaning: Automatically removes advertisements and other non-post entries to ensure clean data.
- Excel Export: Saves the extracted data (Number, Title, Author, Views, Link, Liked) into a clean
.xlsxfile named after the gallery ID.
- Python 3.x
- The libraries listed in
requirements.txt.
-
Clone the repository or download the files.
-
Install the required packages using pip:
pip install -r requirements.txt
The script is run from the command line with arguments specifying the target gallery and pages.
python crawler.py -l <URL> -p <PAGE_RANGE> [OPTIONS]| Argument | Short Form | Description | Required |
|---|---|---|---|
--link |
-l |
The full URL of the gallery board list. | Yes |
--pages |
-p |
The range of pages to crawl (e.g., "1-5"). | Yes |
--search-word |
-S |
An optional keyword to filter posts by their title. | No |
-
Basic Crawling To crawl pages 1 through 3 of the 'record' minor gallery:
python crawler.py -l https://gall.dcinside.com/mgallery/board/lists/?id=record -p 1-3 -
Crawling with a Search Filter To crawl the first 10 pages of the 'record' gallery and only save posts with the word "녹화" in the title:
python crawler.py -l https://gall.dcinside.com/mgallery/board/lists/?id=record -p 1-10 -S "녹화"
For any questions or feedback, please contact kbs.programmer@gmail.com.