tools for preparing seq2seq data by bobbyjaros · Pull Request #114 · BIDData/BIDMach

bobbyjaros · 2016-04-26T23:03:01Z

Adding nnparse.exe and SeqToSeqData.scala, which together can go from paired text files to the formatted matrices consumed by SeqToSeq

[Just a cleaner version of #77 (which had some extra mods unrelated to this PR)]

newparse can optionally output paragraphids and sentenceids for each token. p1 s1 w1 p2 s2 w2 p3 s3 w3 p4 s4 w4 p5 s5 w5 p6 s6 w6 nnparse harnesses this functionality in a very simple version of this, which assumes each newline denotes a paragraph and each ". " or "? " or "! " denotes a new sentence.

Starts with the output of nnparse.exe, two paired files each with this format: p1 s1 w1 p2 s2 w2 p3 s3 w3 p4 s4 w4 p5 s5 w5 p6 s6 w6 (For SeqToSeq we assume each line contains one sentence, so the paragraphid (the first column) denotes the sentence and sentenceid (the second column) is always ignored). The two parsed sentence IMats are paired line-by-line: the ith line of the src IMat corresponds to the ith line of the dst IMat. Produces two paired SMat's of the following form: w00 w01 w02 w03 w04 w05 ... w10 w11 w12 w13 w14 w15P ... w20 w21 w22 w23P w24 w25P ... w30 w31P w32 ... w40P w32P w33 ... where wij is the dictionary index of the i'th word in the j'th sentence and words with a P suffix are padding symbols. The columns of the two output SMat's are still paired: column j of the src output SMat and column j of the dst output SMat correspond to line j of the src input and line j of the dst input respectively. Furthermore, the sentences are collated into batches of similar lengths. The minibatches are randomly permuted after collation to avoid training bias. See in-file docs for additional options.

Bobby Jaros added 4 commits December 17, 2015 22:36

Merge remote-tracking branch 'upstream/master' into nnparse

bcf4d0b

Functionality to map indices from src dict to target dict

45de1ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools for preparing seq2seq data#114

tools for preparing seq2seq data#114
bobbyjaros wants to merge 4 commits intoBIDData:masterfrom
bobbyjaros:nnparse

bobbyjaros commented Apr 26, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bobbyjaros commented Apr 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bobbyjaros commented Apr 26, 2016 •

edited

Loading