This folder contains the data files used in this project:
- SimpleNLG sampled files
- HTML rendered version of the sampled files
- Human-corrected
- GPT4 generated
- JSONL for fine-tuning
- Prompts for inference
Generated using AAC SPEECH STANDALONE, see the repo for details.
Generated using the following Perl one-liner:
cat 2000.out |perl -ne 'if(m/IN/){$i++; print "<li> ($i)";@a=split(/\{/,$_); shift@a;@a=map{s/.*https/https/;s/\s*\}\s*//;$_}@a;foreach(@a){print "<img width=64 height=64 src=\"$_\"/>"}print"<br/>\n"};if(m/OUT/){s/OUT\s*\d+: //;print"<p>";print "($i) "; print;print"</p>";print"</li>"}' > 2000.html
The Libreoffice files were produced by opening the HTML files in Mozilla Firefox, select all, copy, then paste on a blank Libreoffice document. That took a lot of RAM and time to process.
The texts themselves were corrected by hand, including deleting sequences of icons that were non-sensical or included unsavory themes. 504 total utterances are available for training.
The annotation1.odt file contains the 2,000 entries, the annotation_final.odt contains only the 504 hand-corrected.
The .xhtml file was obtaining by saving the ODT file as XHTML. It is used to produce the JSONL files used for fine-tuning and the TXT file for GPT4.
The 504 annotated were sent through GPT4 with the prompt:
Rewrite the following outputs for an AAC Communicator to
make them warmer , more familiar and the type of things a
4 year old would say:
Given the 8k limit, it required 5 calls to GPT4.
The texts were obtained by using the following Bash one-liner:
cat annotation_final.xhtml |grep 'Text_20_body'|grep '<p'|perl -pe 's/.*Text_20_body\"\>//'|perl -ne 's/\<span[^>]+>//g;s/\<\/span\>//g;s/\.?\s*\<\/p.*//;print if m/\(\d+\)/' > annotation-504.txtThe instructions to fine-tune using the base texts were obtained with the following one-liner:
cat 2000.out |perl -e 'open(A, "annotation-504.txt"); @a=<A>;chomp(@a);%a=map { ($d,$t) = m/\((\d+)\) (.*)/; $d=>$t } @a; while(<STDIN>){if(m/IN/){$i++; if($a{$i}){chomp; s/^IN \d+\: //;s-https://textualization.com/acc_icons/--g; print "$_\t".$a{$i}."\n"}}}'|perl -ne 'use JSON; chomp; ($i,$o)=split(/\t/); $i=~s/\{/\n* \{/g;$j={ 'text'=> "<human>: Simulate an AAC communicator given the following icon input: $i\n<bot>: $o\n"};print encode_json $j;print"\n"' > aac-504.jsonl
and
cat 2000.out |perl -e 'open(A, "annotation-toddler-504.txt"); @a=<A>;chomp(@a);%a=map { ($d,$t) = m/\((\d+)\) (.*)/; $d=>$t } @a; while(<STDIN>){if(m/IN/){$i++; if($a{$i}){chomp; s/^IN \d+\: //;s-https://textualization.com/acc_icons/--g; print "$_\t".$a{$i}."\n"}}}'|perl -ne 'use JSON; chomp; ($i,$o)=split(/\t/); $i=~s/\{/\n* \{/g;$j={ 'text'=> "<human>: Simulate an AAC communicator given the following icon input: $i\n<bot>: $o\n"};print encode_json $j;print"\n"' > OpenChatKit/data/aac-toddler-504.jsonl
for the GPT4 ones.
These files should go into OpenChatKit/data
(Might need to install the package libjson-perl to regenerate it.)
For running inference, the full set of prompt inputs is useful:
cat 2000.out |perl -ne 'use JSON; if(m/IN/){chomp; s/^IN \d+\: //;s-https://textualization.com/acc_icons/--g; s/\{/\n* \{/g;$j={ 'text'=> "<human>: Simulate an AAC communicator given the following icon input: $_\n<bot>:"};print encode_json $j;print"\n"}' > 2000.prompts
(Might need to install the package libjson-perl to regenerate it.)
The file data-504-fields.jsonl contains input and output without any system prompt nor task prompt.
It was obtained with:
cat 2000.out |perl -e 'open(A, "annotation-504.txt"); @a=<A>;chomp(@a);%a=map { ($d,$t) = m/\((\d+)\) (.*)/; $d=>$t } @a; while(<STDIN>){if(m/IN/){$i++; if($a{$i}){chomp; s/^IN \d+\: //;s-https://textualization.com/acc_icons/--g; print "$_\t".$a{$i}."\n"}}}'|perl -ne 'use JSON; chomp; ($i,$o)=split(/\t/); $i=~s/\{/\n* \{/g;$j={ 'input'=> $i, 'output'=> $o};print encode_json $j;print"\n"' > aac-504-fields.jsonl
(Might need to install the package libjson-perl to regenerate it.)
The pictograms are authored by Sergio Palao / Origin: ARASAAC / License: CC (BY-NC-SA).
The remaining files are authored Textualization Software Ltd., released CC0 (Public Domain).