Why does the compiler understand ASCII by default?

printf("%s", name);@piefed.blahaj.zone · edit-2 13 days ago

Why does the compiler understand ASCII by default?

modeler@lemmy.world · 13 days ago

I C (and most other languages) you can enter a value like thirty-two using many different number bases: ‘ ‘ as a char, 32 in decimal, 0x20 is hex, 040 in octal or 0b00010000. All of these mean thirty-two, just written differently using different characters.

Note that each of these representations use different digits and/or letters to specify the same number - the mathematical entity thirty-two. These letters and digits are characters stored as numbers, eg the digit zero (0) is coded in ASCII as 48.

In the end, the C compiler will compile the text ‘32’, ‘0x40’ or ‘0b00010000’ into binary and store it in memory the single byte where the 5th bit set to 1 and the others to 0. This is the way computer memory is used to represent the number, and it’s linked to the way the electric circuits do things like add numbers together.

The compiler is the thing that understands that the characters like ‘3’, ‘+’ or ‘F’ and works out what they represent (a number, a variable name or a sum). When it understands that, it will turn the code into bits that the CPU can work with, for example passing it to the compare instruction CMP used to test whether the input is less than the number thirty-two.

This is very meta, and I hope this clarifies rather than confuses. Compilers are notoriously confusing, such as the thought ‘the compiler compiles the compiler code into an executable that can compile itself’.

CameronDev@programming.dev · 13 days ago

You are inspecting a byte, which contains 0x39. That it corresponds to “9” doesn’t matter at all to the compiler at all. If you were on a system that didn’t match “9” <=> 0x39 it would still replace those byte values.

printf("%s", name);@piefed.blahaj.zone · 13 days ago

Thanks! I’ll have to noodle this around a bit because it’s hard for me to understand. xD

CameronDev@programming.dev · 13 days ago

I dunno if this helps, but this screenshot shows the memory view of a program with a string. The hex representation in the middle is what is actually stored in memory. Each pair is one byte/char, and that is what your input[n] <= 0x39 is comparing against.

kubica@fedia.io · 13 days ago

Maybe this is a bit too much if you are just starting but read the first paragraph at least and then also look at “Relationship to ASCII” and “Relationship to Unicode”

https://en.wikipedia.org/wiki/Code_page

printf("%s", name);@piefed.blahaj.zone · 13 days ago

Thanks! Every piece of advice is appreciated! :D

Aaron “Abolish ICE” Madlon-Kay@mastodon.social · 13 days ago

@akunohana Apologies if I’m missing something, but 0x30, 0x39, and 0x18 are merely hex notation for integers. You could have written 48, 57, and 24 instead for the same effect. Or you (probably?) could have used char literals like ‘0’, ‘9’ (I guess there isn’t one for U+0018).

UTF-8 determines how the characters are encoded, i.e. what sequence of numbers (ints) they are represented by. So there’s no special understanding of Unicode or hex going on here; you’re comparing numbers (as you should).

printf("%s", name);@piefed.blahaj.zone · 13 days ago

LOL (please excuse my early 2000’s slang)

Thanks for that clarification.

What I still don’t quite understand is, does this mean that there is no “checking against an ASCII table” or “lookup” going on? But how then does it know that 0x18 is CAN (disregard)?

kubica@fedia.io · 13 days ago

Imagine you want to cypher a text, you can have a table with a column with the char you really want to be the meaning and another column with the obscured representation, what you write is the obscured thing, but when you read you take out that table to translate to the original meaning. Computers do sort of the same thing with the translation from bits to letters. But there are many tables, it depends on the languages to read and write.

Aaron “Abolish ICE” Madlon-Kay@mastodon.social · 13 days ago

@akunohana The compiler doesn’t need to know that 0x18 is CAN; that knowledge is embedded in whatever decided that the data you’re inspecting is UTF-8 or ASCII.

The content of your original post has been replaced with a link that I can’t open so I can’t go back and confirm where you said the data was coming from. But if the data was some other exotic encoding then 0x18 would mean something else in the context of that data.

printf("%s", name);@piefed.blahaj.zone · 13 days ago

Oh, maybe I messed something up when editing… Here’s what I wrote:

I was experimenting with sanitizing user input In other words, I’m simply prompting for user input, read to a character array with fgets.

What it “that thing” that decided the encoding?

Aaron “Abolish ICE” Madlon-Kay@mastodon.social · 13 days ago

@akunohana OK that’s a pretty good question then. In that case the encoding is determined by your terminal (or if not terminal then execution environment). Try invoking env (or locale) and looking at LANG and LC_ALL; those should tell you what your terminal accepts as input and passes along to your program.

printf("%s", name);@piefed.blahaj.zone · 13 days ago

Sweet! I think this is the answer that I was looking for, although my post is poorly phrased. 😅

Does this mean that in theory, there could arise problems with portability? 98-ish percent of all systems use Unicode, but if I were to run my program on an obscure system whose underlying character encoding is not Unicode or some superset of ASCII, I assume it would return other values?

Aaron “Abolish ICE” Madlon-Kay@mastodon.social · 13 days ago

@akunohana Yes! Although I would say that ASCII is a pretty safe assumption and it’s really anything above the top of ASCII that you need to account for (document as a requirement for your program, or take steps to ensure the OS uses the right encoding if you are packaging something for distribution)