Dojan

Back when I spent the night at a Swedish airport I spent most of the morning shooting the shit with a lady coming back home from some tropical place. She'd previously worked in a government position, and after clearing up some gender confusion, she told me that before 1990 the final digits in our personal identity numbers used to correspond to where you were born as well as your gender. Nowadays it's just your gender.

Our "person numbers" are essentially your date of birth combined with 4 digits, YYYYMMDD-XXXX.

Thus if you've a PID looking like 19890221-0271, you can infer that the person is a man born in Stockholm on the 21st of February 1989. This isn't a valid PID however, as it doesn't pass the Luhn algorithm.

Minor segue; trans people can and do get their PID changed to reflect them being man/woman rather than the gender assigned at birth. Non-binary people are unfortunately not represented.

Back in 1990 they changed the "place of birth" bits of the PID and is now assigned randomly. Up until then there were also a range of digits assigned if you were born outside of Sweden and had emigrated. Perhaps in the future they'll do away with the gendering and just have the numbers assigned randomly.

Or they might not.

The vast majority probably do. For a parent or guardian to be useful in this sort of situation they need to take an active interest and forge a bond with their ward, and this day and age I don't think that all who wish to do that have the ability to, and there'll be a decent chunk of people who simply don't care.

I've a parent who didn't really give a fuck. I ended up hitting up lots of random dudes, making a bid for some kind of emotional connection, and no one in my personal vicinity knew, cared, or cared to know. It was a terrible idea, but my story is hardly unique, I know a handful of people with very similar stories.

I think it's a fish. Count the number of bones in the skull.

Ah I see what you're getting at.

I'd like to preface by apologising, because this became a very lengthy comment. I've written a TL:DR in the bottom that I think carries the main point across, all the rest is a semi-technical rather loosey-goosey rundown of how language models work. I just hope it's coherent enough for someone to understand what I'm trying to convey.

So without further ado.

A language model doesn't really train on text. It trains on what's called tokens. As you feed it training data, before it reaches the ML algorithm it goes through a tokeniser. Huggingface has a functional browser-based example here.

A tokeniser essentially splits up the input characters (including whitespace, tabulators, carriage returns etc.) and assigns them numerical identifiers. This is done for the entire dataset before you train the model.

It could look something like this

Thus while you read "strawberry" as its own thing, an LLM might get the input 1, 496, 675, 15717, 1

In essence instead of checking each character individually, you end up with a large dictionary of character groupings, with numerical equivalents, allowing you to do maths with them.

Which is what you do. After the tokenisation the algorithm generates embeddings, which is essentially meant to capture the semantics of language, that is tokens represent individual building blocks of language, and embeddings is what defines the relationship between these tokens. This is stored in something called a tensor, which is in essence a multi-dimensional map. Just like how we map locations in 2D/3D space, machine learning algorithms map "concepts" in sometimes many hundred-dimensional maps.

The embeddings is how an LLM can infer that the word "conceal" and "hide" are related, and that the former is generally considered fancier than the latter. I can almost guarantee that if you were to ask an LLM to rephrase Jane stashed the goods behind the crapper in a fancier, more professional manner, it'd come up with something like Jane concealed the items in the bathroom

This is in part what makes it so hard to glean information from a model, you can't just open up the weights and extract the original training data, it's been chunked, processed, and categorised, and what you end up with is just many different pointers to and between a (relative) few tokens.

For a very long time the context window of these models was very small, and as a result you ended up with outputs that weren't very related to one another. I'm sure you've seen these memes where someone goes "Type I wish and press the middle word in your keyboard and see what you get" kind of memes, and they usually spiral off into nonsense.

That's where the transformer architecture (the T in GPT) came into play. In short, it allowed the models to have a larger "working memory" and thus they could retain and extend that semantic context further. They could build more advanced networks of relationships and it's the source of the current "AI" craze. The models started inferring more distant relationships with words, which is what has given rise to this illusion of intelligence.

Once you have a model trained it's very hard to modify that. You can train auxiliary models to kind of bias the model in various directions. You can write system prompts to try and coax the model into a certain kind of output, but since it isn't actually a thinking thing, they can still go off script. You can do a sort of reverse-engineering of sorts, toggling on and off certain neurons in the model to see how a concept might relate to another, though just like with regular brains a single neuron doesn't typically handle a single thing, and so this is a very time-consuming task.

In the end, the model you train is entirely deterministic, because it's all mathematics. Computers are by their very nature deterministic. The model you train isn't intelligent, and given a particular input it will always produce the same output.

If you've played Minecraft you're probably familiar with the concept of seeds. Just like an LLM, Minecraft's world generation algorithm is deterministic, and if you provide a particular seed value for the randomiser, it will always produce the same world. If you don't input a seed value the game generates a random value and uses that, which is why whenever you start a new world you'll always end up with something new.

That's basically what LLMs do to. When selecting words to continue the given input, it uses a process called stochastic sampling. In essence, for each token input, it gets a bunch of probable tokens that might follow, it organises these in a probability distribution based on a variable called temperature, and then it selects a token from that distribution.

The temperature value essentially controls how randomly it can select words. The lower the temperature setting is the more curved the distribution gets. With a really low temperature setting the deterministic nature of the model shines through. As the temperature increases, the curve flattens and more random tokens might get selected.

At this point, the big "AI" companies have basically sucked the data well dry. They're trying to find more ways of making more data to train on, because what gave them the biggest, most remarkable progress in the past was increasing the quantity of training data. More and more LLM generated text is making it into these models, and existing patterns get reinforced.

TL:DR

I've written this entire comment myself. It is in a sense a mirror of me as a person; the way I punctuate things, the words I choose, the structure in which I've decided to describe things. You can infer bits and pieces about me from it; I've obviously had an interest in machine learning for a while, given the markdown usage I'm perhaps a bit more technically inclined, I might not be an English native, but I've a preference for British English.

Now, I could feed this entire comment through an LLM and you'd get a coherent output. It'd likely change my verbiage, fix the way I punctuate things, perhaps restructure things and make the text overall neater.

However, anything that was me in this text would be lost. There'd no longer be a person to infer anything about. No choices were made in the process of outputting the text. There is no inherent preference on anything because it's all just normalised pseudo-random output from a weighted probability matrix based on a corpus of as much text as whoever trained the LLM could get their hands on, be that legally or otherwise.

That is, I think, essentially what the article is talking about.

Aye. I've been on Tumbleweed on my main desktop for ~2 years at this point. It's really stable. There's been some smaller things I've troubleshooted myself. For example, at some point GDM changed their monitor settings, so in the login screen I'd have a terribly low refresh rate, and when logging in my screen would flash black. I had no idea what exactly was the culprit, but with some digging I found out how to fix that. This here gave me the fix.

Other than that, literally the only problems I've ever had has been because NVidia has gone and fucked something with their drivers. That's happened a handful of times, but I wouldn't put that blame on the distro.

Snapper is such a fantastic tool. Regardless of what distro one uses I'd highly recommend snapper. It comes baked into Tumbleweed, and I manually configured it on my Arch laptop.

Yeah it's hypocritical. The U.S. also participated. If you can toss out Russia you can toss out U.S. and Israel too.

I'm not American. Fix your own damned problems.

If you're willing to work directly for such evil people you are a moron.

What? It doesn't give opinions. It doesn't have opinions, it's a probability matrix.

Put very simply, when it's "trained" it takes a lot of data as input, and out of that it builds a sort of "map" of relationships between the input data. When then given an input it can look up that input in the "map" and gives you the most probable following sequence.

Put simply, if you give it "Once upon..." it'll likely continue with "a time there was" because that's a pattern it'll have seen a lot.

It's an evolution of the autocorrect we have in phone keyboards. It's more advanced, and as people tweak weights of different neurons you can tweak the output. You can also train a model further to fit it to a specific purpose, but ultimately it's just fancy autocomplete.

I’m unsure how much testing is done on Cachy. I’m on Tumbleweed, which is a rolling release with a focus on stability.

There isn’t much point in waiting to apply updates because new builds roll in fairly frequently. It’s not always the same packages of course, but most rolling release distros are on the bleeding edge, it’s kind of the point.

I update a couple of times a month. Around every 7-14 days. You want to avoid letting it go for too long, because as changes accumulate the risks of more complicated conflicts and breakages arising increase.

Trump isn’t the problem. Lop his head off and you’ll have achieved catharsis but it won’t be a solution to any problems. A lot more heads need to roll before anything is solved.

You can’t vote your way out of it because the system is built for those with money. You need to change the entire system and no one who is currently benefiting from that system will aid you in that.

Your people are threatening to invade my neighbouring country. Americans made this mess. Americans get to fix it.

You guys should do like the French, really. Revolt. Anything short of that won’t achieve shit.

The Ribbon interface is terrible, though. The styles selector doesn't fit the entire button, and it also doesn't resize with your window size, remaining super tiny not capable of displaying three full options simultaneously.

Word at least got that right.

My preferred layout is Sidebar, but even there the style is just a regular dropdown. LibreOffice is fantastic, but they need to put some more work into UX.

I don't think others should subsidise American healthcare. We pay for our healthcare via taxes, it's not right for the Americans to mooch off of that.

Then they can pay for it.

Gods yes. They should’ve done that years ago. Second best time is now.

We don’t use Ø in Swedish.

Almost as if using an LLM as a therapist was a bad idea from the start. It’s obviously not bound by confidentiality.

Öl is beer in Swedish. Palm beer.

Leon

@ Dojan @pawb.social

Posts

0
Comments

977
Joined

12 mo. ago

[He/Him]
Software developer by day, insomniac by night. Send me pictures of baby bats to make my day.

Leon

build relationships with them. build relationships with us. build relationships.

I am a 15-year-old girl. Let me show you the vile misogyny that confronts me on social media every day | Anonymous

Nevermind the drink I'm holding

Semantic ablation: Why AI writing is boring and dangerous

When and how often do you update your system?

All of it

Most Americans disapprove of Trump on issues; Americans don’t trust Dems in Congress more to handle problems: Poll

Moats are back!

Semantic ablation: Why AI writing is boring and dangerous

When and how often do you update your system?

Most Americans disapprove of Trump on issues; Americans don’t trust Dems in Congress more to handle problems: Poll

Most Americans disapprove of Trump on issues; Americans don’t trust Dems in Congress more to handle problems: Poll

Most Americans disapprove of Trump on issues; Americans don’t trust Dems in Congress more to handle problems: Poll

LibreOffice blasts 'fake open source' OnlyOffice for working with Microsoft to lock users in

Greenland does not need US hospital boat sent by Trump, says Denmark

Greenland does not need US hospital boat sent by Trump, says Denmark

YouTube, WhatsApp blocked in Russia

ich🍫iel

ChatGPT-maker OpenAI considered alerting Canadian police about school shooting suspect months ago

ich🍫iel