Name	Name	Last commit message	Last commit date
parent directory ..
1000.html	1000.html
1000.out	1000.out
2000.html	2000.html
2000.out	2000.out
2000.prompts	2000.prompts
README.md	README.md
aac-504-fields.jsonl	aac-504-fields.jsonl
aac-504.jsonl	aac-504.jsonl
aac-toddler-504.jsonl	aac-toddler-504.jsonl
annotation-504.txt	annotation-504.txt
annotation-toddler-504.txt	annotation-toddler-504.txt
annotation1.odt	annotation1.odt
annotation_final.odt	annotation_final.odt
annotation_final.xhtml	annotation_final.xhtml

Data files

This folder contains the data files used in this project:

SimpleNLG sampled files
HTML rendered version of the sampled files
Human-corrected
GPT4 generated
JSONL for fine-tuning
Prompts for inference

SimpleNLG sampled files

Generated using AAC SPEECH STANDALONE, see the repo for details.

HTML rendered versions

Generated using the following Perl one-liner:

cat 2000.out |perl -ne 'if(m/IN/){$i++; print "<li> ($i)";@a=split(/\{/,$_); shift@a;@a=map{s/.*https/https/;s/\s*\}\s*//;$_}@a;foreach(@a){print "<img width=64 height=64 src=\"$_\"/>"}print"<br/>\n"};if(m/OUT/){s/OUT\s*\d+: //;print"<p>";print "($i) "; print;print"</p>";print"</li>"}' > 2000.html

Human-corrected

The Libreoffice files were produced by opening the HTML files in Mozilla Firefox, select all, copy, then paste on a blank Libreoffice document. That took a lot of RAM and time to process.

The texts themselves were corrected by hand, including deleting sequences of icons that were non-sensical or included unsavory themes. 504 total utterances are available for training.

The annotation1.odt file contains the 2,000 entries, the annotation_final.odt contains only the 504 hand-corrected.

The .xhtml file was obtaining by saving the ODT file as XHTML. It is used to produce the JSONL files used for fine-tuning and the TXT file for GPT4.

GPT4 enhanced

The 504 annotated were sent through GPT4 with the prompt:

Rewrite the following outputs for an AAC Communicator to
make them warmer , more familiar and the type of things a
4 year old would say:

Given the 8k limit, it required 5 calls to GPT4.

The texts were obtained by using the following Bash one-liner:

cat annotation_final.xhtml |grep 'Text_20_body'|grep '<p'|perl -pe 's/.*Text_20_body\"\>//'|perl -ne 's/\<span[^>]+>//g;s/\<\/span\>//g;s/\.?\s*\<\/p.*//;print if m/\(\d+\)/' > annotation-504.txt

Fine-tuning instructions

The instructions to fine-tune using the base texts were obtained with the following one-liner:

cat 2000.out |perl -e 'open(A, "annotation-504.txt"); @a=<A>;chomp(@a);%a=map { ($d,$t) = m/\((\d+)\) (.*)/; $d=>$t } @a; while(<STDIN>){if(m/IN/){$i++; if($a{$i}){chomp; s/^IN \d+\: //;s-https://textualization.com/acc_icons/--g; print "$_\t".$a{$i}."\n"}}}'|perl -ne 'use JSON; chomp; ($i,$o)=split(/\t/); $i=~s/\{/\n* \{/g;$j={ 'text'=> "<human>: Simulate an AAC communicator given the following icon input: $i\n<bot>: $o\n"};print encode_json $j;print"\n"' > aac-504.jsonl

and

cat 2000.out |perl -e 'open(A, "annotation-toddler-504.txt"); @a=<A>;chomp(@a);%a=map { ($d,$t) = m/\((\d+)\) (.*)/; $d=>$t } @a; while(<STDIN>){if(m/IN/){$i++; if($a{$i}){chomp; s/^IN \d+\: //;s-https://textualization.com/acc_icons/--g; print "$_\t".$a{$i}."\n"}}}'|perl -ne 'use JSON; chomp; ($i,$o)=split(/\t/); $i=~s/\{/\n* \{/g;$j={ 'text'=> "<human>: Simulate an AAC communicator given the following icon input: $i\n<bot>: $o\n"};print encode_json $j;print"\n"' > OpenChatKit/data/aac-toddler-504.jsonl

for the GPT4 ones.

These files should go into OpenChatKit/data

(Might need to install the package libjson-perl to regenerate it.)

Prompts

For running inference, the full set of prompt inputs is useful:

cat 2000.out |perl -ne 'use JSON; if(m/IN/){chomp; s/^IN \d+\: //;s-https://textualization.com/acc_icons/--g; s/\{/\n* \{/g;$j={ 'text'=> "<human>: Simulate an AAC communicator given the following icon input: $_\n<bot>:"};print encode_json $j;print"\n"}' > 2000.prompts

(Might need to install the package libjson-perl to regenerate it.)

Experimenting with different prompts

The file data-504-fields.jsonl contains input and output without any system prompt nor task prompt.

It was obtained with:

cat 2000.out |perl -e 'open(A, "annotation-504.txt"); @a=<A>;chomp(@a);%a=map { ($d,$t) = m/\((\d+)\) (.*)/; $d=>$t } @a; while(<STDIN>){if(m/IN/){$i++; if($a{$i}){chomp; s/^IN \d+\: //;s-https://textualization.com/acc_icons/--g; print "$_\t".$a{$i}."\n"}}}'|perl -ne 'use JSON; chomp; ($i,$o)=split(/\t/); $i=~s/\{/\n* \{/g;$j={ 'input'=> $i, 'output'=> $o};print encode_json $j;print"\n"' > aac-504-fields.jsonl

(Might need to install the package libjson-perl to regenerate it.)

LICENSE

The pictograms are authored by Sergio Palao / Origin: ARASAAC / License: CC (BY-NC-SA).

The remaining files are authored Textualization Software Ltd., released CC0 (Public Domain).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Data files

SimpleNLG sampled files

HTML rendered versions

Human-corrected

GPT4 enhanced

Fine-tuning instructions

Prompts

Experimenting with different prompts

LICENSE

FilesExpand file tree

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data files

SimpleNLG sampled files

HTML rendered versions

Human-corrected

GPT4 enhanced

Fine-tuning instructions

Prompts

Experimenting with different prompts

LICENSE