NOTE: data must be generated with bigcode-ast-tools before being able to use
this tool
bigcode-embeddings allows to generate and visualize embeddings for
AST nodes.
This project should be used with Python 3.
To install the package either run
pip install bigcode-embeddings
or clone the repository and run
cd bigcode-embeddings
pip install -r requirements.txt
python setup.py install
NOTE: tensorflow needs to be installed separately.
Training data can be generated using bigcode-ast-tools
Given a data.txt.gz generated from a vocabulary of size 30000,
100D embeddings can be trained using
./bin/bigcode-embeddings train -o embeddings/ --vocab-size 30000 --emb-size 100 --l2-value 0.05 --learning-rate 0.01 data.txt.gz
Tensorboard can be used to visualize the progress
tensorboard --logdir embeddings/
After the first epoch, embeddings visualization becomes available from
Tensorboard. The vocabulary TSV file generated by bigcode-ast-tools can
be loaded to have labels on the embeddings.
Trained embeddings can be visualized using the visualize subcommand
If the generated vocabulary file is vocab.tsv, the above embeddings
can be visualized with the following command
./bin/data-explorer visualize clusters -m embeddings/embeddings.bin-STEP -l vocab.tsv
where STEP should be the largest value found in the embeddings/ directory.
The -i flag can be passed to generate an interactive plot.