Tokenization

What's a token and how does ChatGPT work?

Happy Monday to all those who celebrate with us in our ai for normal people fam. We’ve got a great newsletter today but before we jump in, I want to make a correction from yesterday.

🟥 Correction from yesterday 🟥 

The last company I covered yesterday, I said biblicai had a .ai at the end of their domain but that is not true. It’s actually .com so the domain should be —> biblicai.com. So if you couldn’t find the website, there it is for you to test out.

My biggest apologies as I pride myself on giving you the best resources across all of the ai space. If I mess something up or don’t give you guys correct information, I will correct it and inform you guys as soon as I possibly can.

Like I said, my biggest apologies.

🏕️ The Setup 🏕️ 

Recently on X, there have been “alleged” reports from users of ChatGPT claiming that their response times have not only gotten slower, but have gotten less in depth and even, less accurate.

A lot of tweets I was researching continued to go back to talking about the “tokenization” ability that was allegedly “limited” from the ChatGPT team.

We all know about the computing power for the ai space. NVIDIA is leading the way with their computing chips in the H100 while people are patiently awaiting the H200. Sam Altman is set out looking for a quick $7Trilli to raise money for more chips. THE PEOPLE NEED MORE CHIPS.

If this story about ChatGPT is true, this is a big deal. With the race that’s on between Gemini, ChatGPT, Grok from Elon, and whatever other companies that come out with a similar product, it will only knock ChatGPT’s trust in the long term.

I’ve been wanting to cover this for awhile, but felt like it’s a perfect day to do it. For today’s newsletter, we’re going a little bit underneath the hood for what’s a token, the process of tokenization, and how does ChatGPT give me answers to my problems?

🔗 What’s a Token? 🔗 

You can’t have tokenization without the actual token.

There’s multiple types of tokens in software. You could have authentication tokens, access tokens, refresh tokens, JSON, compiler, and others. For the sake of this article today, we are specifically talking about programming tokens in reference to ChatGPT.

TLDR: Programming Tokens for ChatGPT is a word, a part of word, a punctuation mark or even a space paired to a numerical ID.

That’s what a singular token is.

What’s Tokenization 🤺 

When you give ChatGPT an input for a request - ie: “what are good things to do around New York City?”

The Model has been trained to do 3 things (at the simplest explanation):

  1. Translate that input from text to numerical ID’s.

  2. Understand the order of the list of numerical ID’s.

  3. Give a response back from the input on 1 and 2 based on previous data training sets given to the model.

TLDR: The process of tokenization is utilizing the individual tokens given to the large language model to give a sophisticated output that lines up with the desire from the user based on their input.

How does ChatGPT give me an answer? 💻️ 

Here’s how it works:

When the input given from the user is sent, it’s sent to a compiler first. The compiler’s job is to break down each list of words that are put together from people as an “input” or “prompt”, and turn those into tokens.

When they have all of the tokens, each token (the numerical ID of the word, part of a word, punctation mark, or space) into a list of numerical ID’s.

When the model receives that list of numbers, the model is going to first pair those numerical ID’s with other numerical ID’s it already has in it’s vocabulary.

Then it will look at the order in which the numerical ID’s were brought in to make sure it has the right understanding.

The model will then send back another string of numbers that are utilized as ID’s attached to words that we’ve given definitions.

Then it’s the compiler’s job to take the list of numbers and turn it back into a list of words in the form of a sentence so we can digest and understand it.

The model’s not reading your input saying “wow, I’m so glad he had a great walk today with his dog and he needs to know what is the best food for his furry friend - let me get that right up for him.”

No

The model has transcribed all of those tokens into numerical id’s (that’s what a token is - the pairing of an input to a numerical piece of data so the model can understand it) so that it can give you your desired outcome as quickly as possible.

Why is this important to me? 💰️

There’s a few reasons why companies would want to slow down their tokenization speeds, accuracy, or depth:

  1. If you start seeing your speed slow down, it could be because they want you to either a) upgrade to a more expensive plan or b) force you to use a new model of the large language model so they can gather more accurate and current data

  2. If you start seeing your accuracy not as on point, it could be because they want you to either a) upgrade to a more expensive plan or b) force you to use a new model of the large language model so they can gather more accurate and current data

  3. If you start seeing your depth decrease in responses, it could be because they want you to either a) upgrade to a more expensive plan or b) force you to use a new model of the large language model so they can gather more accurate and current data

It’s either for your data, or your money.

These are people aren’t running charities.

My job here is to equip with the tools to know not only what’s going on, but how to make a difference. The introduction to tokenization is an important first step.

If you liked this, email us and let me know you did. If you hated it, you can also email me. I’ll email you back ;)

See ya tomorrow,

Zander

💥 Subscriber Count 👉 2,260 🎉 

We crushed our 2,000 goal over the weekend and we’re almost there! Never would’ve believed we’d hit it that fast… can we hit 2,500 by month’s end?! 💥 🎊  

Let’s keep the party going —— Do you know any other normal people?!
Share this sign-up link with them today!

⚡️ RESOURCES ⚡️ 

💣️ WANT MORE ai FNP? 💣️ 

Follow me on X ← DO IT ✖️ 

Connect with me on My LinkedIn ← I WANT TO CONNECT 😎 

Follow our Instagram - BOOM!!! 🧨