Skip to content

Won't read data from UTF-8 model created by C version of word2vec #44

@gerryhocks

Description

@gerryhocks

Hallo,

The code as it stands won't read a UTF-8 vocab from a word2vec binary model created using the C version of word2vec.

This is because the vocab's characters are appended to a string buffer as if a byte is a character.

A workaround/hack like this in Word2VecModel.java's fromBinFile() method gets around this issue and probably still works for single-byte characters:

            byte[] buff = new byte[1024];
            for (int lineno = 0; lineno < vocabSize; lineno++) {
                // read vocab
                int bpos = 0;
                byte b = buffer.get();
                while (b != ' ') {
                    if (b != '\n') {
                        buff[bpos++] = b;
                    }
                    b = buffer.get();
                }
                vocabs.add(new String(buff, 0, bpos, "UTF-8"));

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions