Add another way to run random fun Transformer #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Please don't hesitate to close this one.
I reproduced the all-attention model.
It looks like I can't really use your Linear. weight to dot with another vector because Python told me It is in CPU and other parts are in GPU. Thus, I have to change it to do linear(q).
Thus, this pull request is about changing from the weight.T to linear(q), etc.
The result is not good. I guess they are right that you can't remove the Softmax.
This is a sign that an efficient Transformer doesn't mean it is about going it all MLP.
Or I am wrong here because one or some of the residual connections are missing with the All-Attetion?
If you are interested, please feel free to check out the notebook or the website to see how I trained the model with the Shapkeare dataset with my own version of Nano GPT.

The number of parameters goes from 19 to 12 with All Attention, but the result is not great.
https://github.com/JonathanSum/NLP-Notebooks-Andrej-Course/blob/main/gpt2_part2.ipynb
Website result:
https://jonathansum.github.io/Blog/#/gpt2_part2