Name	Name	Last commit message	Last commit date
parent directory ..
bigcode_fetcher	bigcode_fetcher
bin	bin
tests	tests
.gitignore	.gitignore
.pylintrc	.pylintrc
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
README.md	README.md
requirements.txt	requirements.txt
setup.py	setup.py

Name

Last commit message

Last commit date

bigcode_fetcher

bigcode-fetcher

A utility to search and fetch code from GitHub. This tool was build to easily create datasets for repository analysis.

The tool works in two phases, search finds repositories using the GitHub API, and saves the result in a JSON file. download fetch all the repositories inside the JSON file.

Install

This tool can be installed by running

pip install bigcode-fetcher

or by fetching this repository and running

pip install .

in this directory.

Usage

`search` command

By default, the utility searches for repositories fulfilling the following conditions

size between 1M and 100M
stars count > 10
non-viral license (MIT,Apache-2.0,MPL-2.0,BSD-2-Clause,BSD-3-Clause,BSD-4-Clause,MS-PL)

and retrieves the first 100 projects, ordered by number of stars.

To avoid API rate limiting, an access token can be provided either with the --token CLI argument or with the GITHUB_TOKEN environment variable.

See the help to see all the options:

bigcode-fetcher search -h

Example

Search for all Apache commons projects written in Java

mkdir -p apache-common-projects
bigcode-fetcher search --language Java --user apache --stars '>0' --keyword commons --max-repos 500 -o apache-common-projects/apache-commons.json

`download` command

This commands will simply git clone all the repositories in the JSON generated by the search command.

To reduce the download size, only the latest revision is fetched by default (i.e. git clone --depth 1). This can be disabled by passing in the --full flag.

USERNAME/REPO will be fetched in OUTPUT_DIR/USERNAME/REPO, where OUTPUT_DIR is set by the --output option.

The command will ignore the project if the directory already exists, so running the command multiple times is safe, and recommended to make sure all repositories have been fetched.

See the help for more information:

bigcode-fetcher download -h

Example

Download all the Apache commons project generated above

mkdir -p apache-common-projects/repositories
bigcode-fetcher download -i apache-common-projects/apache-commons.json -o apache-common-projects/repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

bigcode-fetcher

Install

Usage

`search` command

Example

`download` command

Example

FilesExpand file tree

bigcode-fetcher

Directory actions

More options

Directory actions

More options

Latest commit

History

bigcode-fetcher

Folders and files

parent directory

README.md

bigcode-fetcher

Install

Usage

search command

Example

download command

Example

`search` command

`download` command