Open
Conversation
added 4 commits
December 17, 2015 22:36
newparse can optionally output paragraphids and sentenceids for each token.
p1 s1 w1
p2 s2 w2
p3 s3 w3
p4 s4 w4
p5 s5 w5
p6 s6 w6
nnparse harnesses this functionality in a very simple version of this, which
assumes each newline denotes a paragraph and each ". " or "? " or "! "
denotes a new sentence.
Starts with the output of nnparse.exe, two paired files each with this format:
p1 s1 w1
p2 s2 w2
p3 s3 w3
p4 s4 w4
p5 s5 w5
p6 s6 w6
(For SeqToSeq we assume each line contains one sentence, so the paragraphid
(the first column) denotes the sentence and sentenceid (the second column)
is always ignored).
The two parsed sentence IMats are paired line-by-line: the ith line of the
src IMat corresponds to the ith line of the dst IMat.
Produces two paired SMat's of the following form:
w00 w01 w02 w03 w04 w05 ...
w10 w11 w12 w13 w14 w15P ...
w20 w21 w22 w23P w24 w25P ...
w30 w31P w32 ...
w40P w32P w33 ...
where
wij is the dictionary index of the i'th word in the j'th sentence and
words with a P suffix are padding symbols.
The columns of the two output SMat's are still paired: column j of the
src output SMat and column j of the dst output SMat correspond to line j
of the src input and line j of the dst input respectively.
Furthermore, the sentences are collated into batches of similar lengths.
The minibatches are randomly permuted after collation to avoid training bias.
See in-file docs for additional options.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adding nnparse.exe and SeqToSeqData.scala, which together can go from paired text files to the formatted matrices consumed by SeqToSeq
[Just a cleaner version of #77 (which had some extra mods unrelated to this PR)]