Skip to content

Conversation

@AliceSum
Copy link

@AliceSum AliceSum commented Jan 30, 2023

Please don't hesitate to close this one.

I reproduced the all-attention model.

It looks like I can't really use your Linear. weight to dot with another vector because Python told me It is in CPU and other parts are in GPU. Thus, I have to change it to do linear(q).

Thus, this pull request is about changing from the weight.T to linear(q), etc.

The result is not good. I guess they are right that you can't remove the Softmax.

  1. This is a sign that an efficient Transformer doesn't mean it is about going it all MLP.

  2. Or I am wrong here because one or some of the residual connections are missing with the All-Attetion?

If you are interested, please feel free to check out the notebook or the website to see how I trained the model with the Shapkeare dataset with my own version of Nano GPT.
The number of parameters goes from 19 to 12 with All Attention, but the result is not great.
https://github.com/JonathanSum/NLP-Notebooks-Andrej-Course/blob/main/gpt2_part2.ipynb
Website result:
https://jonathansum.github.io/Blog/#/gpt2_part2
image

update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant