Read or listen to the newsletter with all the documents I’ve chosen to host here: Support my learning journey or by clicking the Join button above, become a Patreon member or one-time Venmo! Discuss this with other Tunadorks on Discord All my other links Timestamps: 00:00 Introduction 00:37 Unexpected benefits of self-modeling in neural systems 01:57 Gemma 2 – Improving open LMs at a practical scale 02:27 Next generation reservoir computing 03 :42 Anthropic Circuits Updates – July 2024 04:30 Transformers are universal learners in context 05:47 Revisiting token embedding with…
Great video, would be better without the awful haircut though
Subscribe
Got like 90 accounts of AI music tools that I will never use because their initial free credit churned out a horrible sample. The real money is making a product that works, not slapping "AI" on the website and calling it a day. Sux to have the SEO throw out 5 pages of garbage.
A couple years ago I used LeCun's JEPA-based self-modeling to improve next-day weather prediction by 50% (MSE of temp, prec, wnds)
Our method was to predict feature-based representations of the NEXT day using the current day's features.
Basically, you use a restnet VAE to parse weather data into features, then use those to predict features of the next day.
The feature prediction loss was then added to total training loss (along with standard full-res reconstruction loss).
What I suspect this did was to reward the model for focusing on "predictable and durable features" across multiple days.
An important observation was that the JEPA predictor sub-net, which used current features to predict next-day features, needed to be "simple" (1-2 hidden conv layers). This likely forced representations to be simple yet useful – say, simple convolutional translations of moving weather patterns.
For example, if you have a "front feature" at a particular location, chances are it will move/geographically translate to the next day, yet remain a "front feature" in JEPA feature space.
The project was a low budget student-level (me lol) sub-project that didn't seek funding, due to the top level project already being funded. Unaware of the funding details, the project had consistent interest from ATOC scientists during check-in meetings.
My last week, on our last google meets call with a project scientist from an out-of-state Uni, he was surprised it was my last week (wasn't even aware of it before my boss mentioned this to him as he started getting too excited about the project's potential). After that news, he assured me I won't have a problem finding another position. Haha, a year later … lol
Career lesson: It's tough getting in from the outside. Make sure you have fallback positions lined up while you're still employed. It's just like women who value you more for your CURRENT high-value connections than your skills and past success lol
Another observation was how the established scientists were already freaking out 2 years before the end of their contract. Guys, you gotta plan years ahead in this field.
Damn, just downloaded like half that list. Love the curation you do.
The Apple intelligence paper isn’t too interesting, but have a look at section 5.1, something about adapting to the task at hand on the fly using LoRA. I don’t know about other literature related to this, but sounds pretty interesting to me
Child-language use: blind children?
suprised youve never heard of -1 bit quantization when it was mentioned several times last year in /g/aicg
Yo, me got addicted to your channel. I kinda binge-watched your latest vids. I just grab what I can and then guess the concepts of those I do not fully understand.
Seems like we could use synthetic data for the blind vision model problem. They could use Unreal or Unity armed with a huge pile of game dev artist created models and shaders to set up millions of permutations complex scenes from different angles and also labels which we could piece together as we’re assembling the scene and train on that.
I have to assume Musk & Co are doing that sort of thing for their robot training.
really like the first paper shown. Its interesting how introducing self modeling has the consequence of also simplifying the network. I mean, it makes sense that the model would want to be simpler in order to optimally compute itself. I do wonder what effect the self modeling has besides that though: is the primary effect the simplificaton of the network when training? or does the auxillary task of predicting internal states assist with the primary task in a meaningful way? judging from the paper, it seems accuracy in the task actually drops slightly (although MNIST is such a simple classification example that I'm not sure that says anything about performance anyway). Really interested to hear more about this strategy in larger models.
11:10 : oh, cool, this sounds similar to something I was daydreaming about (except I was imagining clusters of a handful of tokens not necessarily matching sentence boundaries, and I was imagining doing this recursively).
Like, I imagine this is like: have an auto-encoder that goes from a not-too-long sequence of tokens, to a single higher-level token, and then the decoder part predicts the individual tokens given the previous higher-level tokens, the current higher-level tokens, and the base-level tokens already produced corresponding to the current higher-level token?
I suppose their tokens encoding entire sentences can’t be using a fixed discrete set of tokens for the higher level tokens, so, I guess they just have those be continuous?
(Aside: hm, if you used a standard decoder-only LLM, but instead of selecting a token with the probabilities it assigns, just took the average of the embedding vectors for each of those tokens, and let that iterate a dozen times, and then switched to picking specific tokens again, I wonder what kind of garbage output that would produce?
That thought probably seems pretty unrelated. It came to mind because I was thinking about how, when the “tokens” produced as outputs, are continuous, you don’t get a probability distribution, so the only way to mix between options is to mix the actual options, rather than a probability mix of options.)
Another idea I had in relation to this, was that maybe the encoding for a cluster of tokens could have two parts, one which is only used when decoding to try to get the particular tokens back, and one which is used for that but also used when predicting the next higher-level token. The idea being that this might encourage it to separate the parts that matter significantly later in the text, with irrelevant accidents of phrasing. Perhaps somewhat of a semantics vs phrasing distinction… ..but probably not quite, because the phrasing at one part probably helps predict the phrasing at a later point, due to stuff like different writing styles, etc., so probably not a clean split.
Very nice review. That first paper got my attention!
Do you ever take this information and rework it into multi dimensionally frameworks when you come across new information,. I watch your videos find the original source a interpret into my system ai frameworks in many different formats and sources of data. Was just wondering if anyone else does that?😊
first paper out of the gate sounds like a winner 🤯
Skimming abstracts, I love it! Have some engagement.