Add another way to run random fun Transformer #5

AliceSum · 2023-01-30T15:42:48Z

Please don't hesitate to close this one.

I reproduced the all-attention model.

It looks like I can't really use your Linear. weight to dot with another vector because Python told me It is in CPU and other parts are in GPU. Thus, I have to change it to do linear(q).

Thus, this pull request is about changing from the weight.T to linear(q), etc.

The result is not good. I guess they are right that you can't remove the Softmax.

This is a sign that an efficient Transformer doesn't mean it is about going it all MLP.
Or I am wrong here because one or some of the residual connections are missing with the All-Attetion?

If you are interested, please feel free to check out the notebook or the website to see how I trained the model with the Shapkeare dataset with my own version of Nano GPT.
The number of parameters goes from 19 to 12 with All Attention, but the result is not great.
https://github.com/JonathanSum/NLP-Notebooks-Andrej-Course/blob/main/gpt2_part2.ipynb
Website result:
https://jonathansum.github.io/Blog/#/gpt2_part2

update

update

0747012

update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add another way to run random fun Transformer #5

Add another way to run random fun Transformer #5

Uh oh!

AliceSum commented Jan 30, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add another way to run random fun Transformer #5

Are you sure you want to change the base?

Add another way to run random fun Transformer #5

Uh oh!

Conversation

AliceSum commented Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AliceSum commented Jan 30, 2023 •

edited

Loading