Programming thread

Least Concern · Feb 2, 2021

stares at error messages said:
C really can't be beat for embedded stuff. I was looking at the Rust runtime libraries for embedded devices and they looked really immature. If you were to put the C and Rust side-by-side and the C is more readable. The C versions are cleaner and don't have strange macros and constructions. The way you acquire pins is also less direct. First having to initialise the pin interface and then select the pin and mode you want, while in C you can just get the pin and mutate it. This is an example with Arduino:

LED blink in C

Same in Rust

Arduino Code is its own C-style language, but isn't C.

All roads lead to Rome and all programs compile to C. Even though the rustc is self-hosting it still just links against LLVM C bindings to do code generation. Until code generation is not done in C, we'll still have it around.

I'm not sure what exactly you meant by this but if you meant that code in other languages gets transcribed to C before being compiled, that is incorrect. Almost all compiled languages get compiled directly to machine code and any language which needs to be transcribed to C first would definitely be the exception, not the rule.

Kosher Dill · Feb 2, 2021

jimsterlingspronoun said:
writing software for boeings

I bet I can guess which one.

stares at error messages said:
I'm not sure how strlen is implemented

There's really only one way it could be implemented for a null-terminated string, optimizations aside.
I wouldn't call this a C problem though. In C you're perfectly free to write your own string implementation with whatever kind of length-tagging and bounds-checking you like. Microsoft uses length-counted strings in C all over the place, like BSTR in various COM and COM-adjacent places, and tagged strings in both 1 and 2-byte widths in Excel.

Ledj · Feb 2, 2021

stares at error messages said:
Like what @awoo said, it is really a C problem that when you iterate over char array in C, arrays don't carry information about what the last item in them are. I'm not sure how strlen is implemented; maybe it is susceptible to this problem. The real problem is the C does not have foreach loops. You can't have a buffer overflow if your loop stops when you no longer have data to read. The design flaw is checking for something that can not be there and by extension is insecure. The better practice is not to trust your data has been created correctly; correctly being with \0.

In the above example, we don't rely on the data telling us how to react to it, rather we take the compiler defined length of the data and use this to govern our operation with the data. We can trust the compiler as much as we are willing to trust the the bin created be the compiler.

One of the strengths of C is that it never adds extra complexity where a user doesn't want it. It really does give you nearly "atomic" units (not the concurrency idea). There are many, many situations where forcing an array to also maintain memory for its length or other operations is paying for something you don't need. Think of all the stack-only situations where you already have the constant representing the length at your disposal.

It's not a C problem that char[] behaves exactly like any other array. It's a problem when people view char[] as the higher level notion of "string" invoked by more modern languages and expect similar results. char[] is no more a string than int[] is a "vector" container.

The idea to create the lightest possible string by overloading the meaning of the zero byte can be useful in some situations, but it's not useful in many situations where the burden of maintenance outweighs the benefits it might provide. In these situations, one should craft a structure that better maintains the details requisite for ease of use.

A 'cstring' is merely a concept that was started and supported by the standard library, nothing more, nothing less. Anyone could make a library with their own cute terminating byte or set of terminating bytes. In most situations, today, when do people really use pure char[] cstrings? It requires that you really only ever need the ASCII or ANSI character set, and that the data you are receiving is that or conformant UTF-8 (god help you if BOM is in there). It's not plausible in many modern contexts. Generally you have a smarter string that can perhaps can decay into just the cstring array when you have a library that requires such a representation. Also, the smarter string ensures proper enforcement of the byte terminator.

Regardless, the point is that there is nothing inherently wrong with arrays in C. They are exactly the fundamental unit you need in order to build whatever more complex structure you need. If you are constantly passing around extra meta-information about the array, make a proper container. Don't just use raw arrays. std::vector is also a T[], just wrapped by an interface that makes it less error prone. "For each", then, is possible with a C array, but you have to build it. And that's one reason why C / C++ are so good: if you don't like it, you can build your own variant and it's not any less performant than the library's offering (provided your implementation is sound.).

Least Concern said:
I'm not sure what exactly you meant by this but if you meant that code in other languages gets transcribed to C before being compiled, that is incorrect. Almost all compiled languages get compiled directly to machine code and any language which needs to be transcribed to C first would definitely be the exception, not the rule.

The reach of LLVM is actually growing to be quite large. Many languages are actually transpiled into LLVM's intermediate language, optimized, then finalized through various LLVM backends.

Marvin · Feb 2, 2021

Ledj said:
It'll be hard to find anyone that actually revels in the design of c strings, but you did say they caused billions of dollars of mistakes. No, they did not. Programmers caused those errors, not the language. C strings are error prone if you are not careful but they are not "hard" for experts. It's standard array operations and remembering to append the null terminator. If that's hard, then god help you.

I think that's a distinction without a difference. If a language has important features that are arbitrarily easier to fuck up, that's the language's fault.

Even the best programmer in the world is going to fuck up sometimes. If they've got a choice between two languages, that are identical in every possible way except one leans heavily on C strings and the other doesn't, then yeah, the language is at fault for that.

stares at error messages said:
All roads lead to Rome and all programs compile to C.

That's not true. C's used in mainstream OSes not really because of any inherent advantages, but mostly as a historical accident.

x86 calling conventions are more important. Like libdl. What language actually the libraries are actually written in is less important. (In fact, it's very nice when interpreted languages come with libdl bindings that you can use in the interpreter. A little dangerous, sure, but it makes experimentation very enjoyable.)

stares at error messages said:
C really can't be beat for embedded stuff.

There's a Forth (or some closely related derivative) written for basically every CPU out there. It's far more dynamic and capable than C, while being competitive efficiency-wise.

Ledj · Feb 2, 2021

Marvin said:
I think that's a distinction without a difference. If a language has important features that are arbitrarily easier to fuck up, that's the language's fault.

Even the best programmer in the world is going to fuck up sometimes. If they've got a choice between two languages, that are identical in every possible way except one leans heavily on C strings and the other doesn't, then yeah, the language is at fault for that.

I think it's just looking at the language in the wrong way. C / C++ provide extremely low level operations so that, when you need it, you can express it in the language. They also provide ways to avoid staying entirely in the lowest of operations that are error-prone (C++ does a much better job of this). I think it's more of a question of usage and programmer expectation than a language issue.

The simplest answer is merely to make a better string type. I think it should be noted that while many modern languages have their string type as a fundamental type, that's not the case in C or C++. Furthermore, it is not uncommon to make your own; std::string is not used nearly as frequently as you might think, and neither are "cstrings". The point is, when you can, make your own. C and C++ allow this and, in the modern era, expect it.

If you're just writing a program that needs to accomplish a few goals and high performance isn't an issue, why even bother using cstrings in the first place? Make more complex, expensive structs / classes. Or use another language entirely. Different languages serve to fill different roles. I won't fault Ruby for not letting me program drivers efficiently -- that wasn't what it was designed for.

awoo · Feb 3, 2021

Ok, maybe the question is why did C have null terminated string functions in its standard library? Could the standard library which has to be implemented be done without the string functions then? My guess is that it was considered sufficiently low level because even some assemblers have support for nul teriminated char arrays. (Edit: I found Joel blog that said it was precisely because the microprocessor had ASCIZ type. it's a fun read https://www.joelonsoftware.com/2001/12/11/back-to-basics/ ) Or would it be reasonable to have the standard library include length-prefixed string functions, where the length is stored as some implementation-defined type? Maybe the C design committee could've included a built-in simple length-prefixed string type. I did notice that C had a complex number type that idk how it's implemented but seems as non-trivial as a simple string.

Also it's curious to me that the standard library also made a whole copy of the functions for wide characters too I guess for whoever really wanted them.

stares at error messages · Feb 3, 2021

Least Concern said:
Arduino Code is its own C-style language, but isn't C.

https://www.nongnu.org/avr-libc/user-manual/modules.html Yes it is. It's not the same runtime, but it is C. If it's C-style do you mean C++ or something?

Kosher Dill said:
There's really only one way it could be implemented for a null-terminated string, optimizations aside.

Relying on \0 termination is what's wrong. It would be better to treat strings more like arrays and not rely on any data inside the string to define it.

C:

#include <stdio.h>
#include <string.h>
#define strlen2(s) sizeof(s)

int strlen1(char *str)
{
  char curr;
  int x = 0;
  while (1)
    {

      curr = str[x];
      x++;
      if (curr == '\0')
        {
          return x;
        }
    }
}

I'm assuming that sizeof can actually introspect the memory allocated and isn't just looping like in strlen1.

Ledj said:
One of the strengths of C is that it never adds extra complexity where a user doesn't want it. It really does give you nearly "atomic" units (not the concurrency idea). There are many, many situations where forcing an array to also maintain memory for its length or other operations is paying for something you don't need. Think of all the stack-only situations where you already have the constant representing the length at your disposal.

It's not a C problem that char[] behaves exactly like any other array. It's a problem when people view char[] as the higher level notion of "string" invoked by more modern languages and expect similar results. char[] is no more a string than int[] is a "vector" container.

The idea to create the lightest possible string by overloading the meaning of the zero byte can be useful in some situations, but it's not useful in many situations where the burden of maintenance outweighs the benefits it might provide. In these situations, one should craft a structure that better maintains the details requisite for ease of use.

A 'cstring' is merely a concept that was started and supported by the standard library, nothing more, nothing less. Anyone could make a library with their own cute terminating byte or set of terminating bytes. In most situations, today, when do people really use pure char[] cstrings? It requires that you really only ever need the ASCII or ANSI character set, and that the data you are receiving is that or conformant UTF-8 (god help you if BOM is in there). It's not plausible in many modern contexts. Generally you have a smarter string that can perhaps can decay into just the cstring array when you have a library that requires such a representation. Also, the smarter string ensures proper enforcement of the byte terminator.

Regardless, the point is that there is nothing inherently wrong with arrays in C. They are exactly the fundamental unit you need in order to build whatever more complex structure you need. If you are constantly passing around extra meta-information about the array, make a proper container. Don't just use raw arrays. std::vector is also a T[], just wrapped by an interface that makes it less error prone. "For each", then, is possible with a C array, but you have to build it. And that's one reason why C / C++ are so good: if you don't like it, you can build your own variant and it's not any less performant than the library's offering (provided your implementation is sound.).

It is not a strength to have buffer overflows.
I propose that strings be treated more like arrays and not rely on \0 terminator to avoid failing too check for it. Where the length of the string is what the compiler allocated. The compiler should tell you this and no the string its self.
Yes the arrays are fine. It's the strings that are the problem. Even if strings are defined with " and ", they should really act more like arrays and not really be different. Automatically you get null termination, while this would break things that need it, it be better to just go off the length of the array.
The reason I bring up foreach loops is because they are a good idea. They let you iterate over the data. That way you are holding the data that your testing if this is a parser. Overflows are caused by waiting for things that never happen, so you need not to check for things that could be maliciously constructed. With foreach this is implicate. I have no doubt that c could have foreach loops.

It is a European truism that all roads lead to Rome, in history figuratively and in reality the transport map of Europe was largely influenced by existing Roman roads in Europe. Aside from the rare situation where a program needs to be compiled from not C into C, when a C compiler is only available for platform, usually something old, at the moment. Compilers call C and C++ during code generation. So your program is written in Language X, but your compiler is written in C, or at least the part that does code generation. In this sense the compiler was always put though the C compiler and therefore all programs end up touching C the same way all roads in Europe touch Rome. It's figurative. Of course it's not literal. If it were literal, the weather forecast would be frogs with a 50% change of locusts at the weekend.

Kosher Salt · Feb 3, 2021

Ledj said:
In these situations, one should craft a structure that better maintains the details requisite for ease of use.

A 'cstring' is merely a concept that was started and supported by the standard library, nothing more, nothing less.

One should reinvent a better string, but then one probably has to reinvent standard library I/O functions too because they'll all expect a cstring. If they had made library functions that allow passing a separate length argument it would make it a lot easier to use Pascal-style strings when the benefit of storing the length outweighs the cost, and programmers might have been less inclined to always default to cstrings.

Full Race Replay · Feb 3, 2021

i assumed that null-terminated char arrays were considered the norm in C instead of a struct with the char array and a size_t of the length because it would take a little less memory? (Don't forget that using a struct like that is setting yourself up for a whole new world of errors to do with setting the length to the wrong number)

Kosher Dill · Feb 3, 2021

awoo said:
Or would it be reasonable to have the standard library include length-prefixed string functions, where the length is stored as some implementation-defined type?

Depends what you consider "reasonable".
Something like what you're imagining has been done as a library, at any rate, though I've never tried it:
http://bstring.sourceforge.net/

stares at error messages said:
I'm assuming that sizeof can actually introspect the memory allocated and isn't just looping like in strlen1.

Why would you assume a language feature like sizeof should be able to "introspect" the memory? Should the language know what allocator (if any!) you are using and what its internal data structures look like?
Consider that in C you might just be using some stack space as a buffer, you might be using the malloc that came with your implementation of the CRT (which may not be the same as mine), you might be doing arena allocation and managing everything within one giant block yourself, you might be calling out to the operating system and using its heap allocator, or you might be going full Terry and just writing directly to memory locations because you control the entire address space.

The only time sizeof actually does what you want is when you've allocated a fixed-size buffer at compile time. But there's no "magic" there, it's just a shortcut so you don't have to remember the number of bytes yourself and hardcode it in.

Kosher Salt · Feb 3, 2021

Full Race Replay said:
i assumed that null-terminated char arrays were considered the norm in C instead of a struct with the char array and a size_t of the length because it would take a little less memory? (Don't forget that using a struct like that is setting yourself up for a whole new world of errors to do with setting the length to the wrong number)

That's one benefit, another is that it's stupid-easy to memcpy with a repnz so it just goes until it hits zero (null).

Ledj · Feb 3, 2021

stares at error messages said:
It is not a strength to have buffer overflows.

(Again, you are talking about arrays and strings as modern languages do. In C an array is just a contiguous block of memory that consists of elements of the exact same type. Nothing more, nothing less.)

Yes, it's a strength when a language doesn't have any forced bounds checking by default. First of all, remember that var[n] is just another way of expressing *( (&var) + n). It relies on the fact that arrays often decay into pointers to the starting address; regardless, it's just shorthand for some basic pointer math.

1) How are you going to have some language-wide "checking" system on pointer math? It's not possible, nor is it desirable.
2) What should happen when n exceeds the boundaries of var? Should n get clamped? Should the operation cancel? Should some higher level error be propagated? These are decisions that should be made by the programmer so they can choose the appropriate response given the context. There is no default cost that makes sense to pay, especially if there's no actual way to exceed the boundaries.
3) Why should the programmer expect or want to pay for hidden instructions every time when doing something as seemingly transparent and simple as pointer math?

Most languages these days enforce their own paradigm. In C# using the [] operator means calling into code using an exception framework. Furthermore, if you want to "handle" an out of bounds, you have to deal with exceptions. C lets you handle it yourself (or not handle it at all). There are many times when you are writing loops that cannot exceed the boundaries of an array. Why pay for any checking in any of those situations? Again, if your response is "I shouldn't have to make a decision about this..." then a language like C certainly isn't for you.

A buffer overflow happens when the programmer isn't careful; any modern language will also let you write an infinite loop if you aren't careful. No one considers that a design flaw (in fact, some might argue it's a great proof the language is Turing complete). The buffer overflows associated with cstrings are technically infinite loops, just stopped due the inevitable accidental zero byte encounter or OS memory violation.

Ultimately, though, every language no matter how "safe" or "unsafe" is doing this kind of pointer math internally. It is absolutely required; and any language that wants to get decent performance has the ability to bypass bounds checking. The question I suppose you are trying to address is "should the programmer also get this choice?"

Finally, generally speaking, it's not like every C programmer is using GCC 4.x and writing their code in notepad. The ideal way to deal with these systems now is to use tooling that helps find errors and also do bounds checking for development builds, and then turn it off for a production build in all areas that aren't critical. (One of the major reasons buffer overflows are so prevalent is due to the strange popularity of C; it's a very difficult language to leverage properly across a project and median-grade programmers should stay far, far away. Yet that's not been the case, historically.)

Kosher Salt said:
If they had made library functions that allow passing a separate length argument it would make it a lot easier to use Pascal-style strings when the benefit of storing the length outweighs the cost, and programmers might have been less inclined to always default to cstrings.

The "_s" suffixed variants of string functions introduced in C11 more or less cover this.

Strange Looking Dog · Feb 3, 2021

stares at error messages said:
I'm assuming that sizeof can actually introspect the memory allocated and isn't just looping like in strlen1.

sizeof only returns the size for a constant length or variable length array (an array allocated on the stack with a runtime length —int a[n]), attempting to get the length of a dynamic array will return the length of the pointer to the array. sizeof does not concern itself with strings, it only operates on types. The version that handles expressions does not actually evaluate the expression, it merely uses the type of the expression.

Ask yourself, how would you actually go about implementing the special behavior for getting the length of a string through sizeof? Would you just get the size of the allocated memory block that the pointer is pointing at? What if the string was allocated from an arena allocator, or otherwise not directly by the standard method? It would return the size of the entire block —on the best of days at least, and even then that would be a useless quantity.

Full Race Replay said:
i assumed that null-terminated char arrays were considered the norm in C instead of a struct with the char array and a size_t of the length because it would take a little less memory? (Don't forget that using a struct like that is setting yourself up for a whole new world of errors to do with setting the length to the wrong number)

You can do that like this pretty simply

C:

typedef struct {
    int length;
    char data[0]; // Length 0 is nonstandard but common
} string_t;

string_t* AllocString(int length) {
    string_t* str = (string_t*)malloc(sizeof(int) + sizeof(char) * (length + 1));
    str->length = length;
    str->data[length] = '\0';
    return str;
}

Full Race Replay · Feb 3, 2021

sizeof is a compile-time directive so the best it could do is if you define a non-malloc'd char array (char something[10]) it would tell you the entire length of the array you defined. There is no way it would know what to do with null operator etc. especially if you have a dynamically-allocated array which changes it's size.

Ledj · Feb 3, 2021

Full Race Replay said:
i assumed that null-terminated char arrays were considered the norm in C instead of a struct with the char array and a size_t of the length because it would take a little less memory?

In addition to what ConcernedAnon wrote, more complex string types are also a great opportunity to add customizable, automatic SSO (short-string optimization) support via a union; that's used frequently these days.

Marvin · Feb 3, 2021

Ledj said:
I think it's just looking at the language in the wrong way. C / C++ provide extremely low level operations so that, when you need it, you can express it in the language. They also provide ways to avoid staying entirely in the lowest of operations that are error-prone (C++ does a much better job of this). I think it's more of a question of usage and programmer expectation than a language issue.

The simplest answer is merely to make a better string type. I think it should be noted that while many modern languages have their string type as a fundamental type, that's not the case in C or C++. Furthermore, it is not uncommon to make your own; std::string is not used nearly as frequently as you might think, and neither are "cstrings". The point is, when you can, make your own. C and C++ allow this and, in the modern era, expect it.

If you're just writing a program that needs to accomplish a few goals and high performance isn't an issue, why even bother using cstrings in the first place? Make more complex, expensive structs / classes. Or use another language entirely. Different languages serve to fill different roles. I won't fault Ruby for not letting me program drivers efficiently -- that wasn't what it was designed for.

I think some people overestimate the contexts in which they should use C.

I think that if you compare the risks of misallocating resources, versus the costs of just relying more on garbage collection or other automated systems, (plus the cheapness of computer resources vs programmer time), it's a bad decision in more ordinary programming situations to go with C for something. C is something that needs to be justified.

There's an attitude I see in some programmers (not that I'm accusing you of it, but I see it) where they want to go with C specifically as kind of a macho thing. Or "I can do it better than the computer". Truth is, humans can't do it better than the computer on average, over a long enough time period.

It's like the programmer version of "the house always wins".

And I'm not saying that garbage collection is a magic solution. It's not, it's just a tool, like anything else. Ideally for most uses, you can leave the various GC settings at their defaults, but it's important to know how GCs work and how to tune them for specific work cases.

I think I'm basically saying that I'd much rather work with an experienced, talented programmer who prefers a managed language over C, and that they know how to tune the GC for corner cases, over the same experienced, talented programmer who much prefers to do resource handling themselves. The cost of a leak in programmer time is just too much, most of the time.

Ledj · Feb 3, 2021

Marvin said:
I think some people overestimate the contexts in which they should use C.

I think that if you compare the risks of misallocating resources, versus the costs of just relying more on garbage collection or other automated systems, (plus the cheapness of computer resources vs programmer time), it's a bad decision in more ordinary programming situations to go with C for something. C is something that needs to be justified.

There's an attitude I see in some programmers (not that I'm accusing you of it, but I see it) where they want to go with C specifically as kind of a macho thing. Or "I can do it better than the computer". Truth is, humans can't do it better than the computer on average, over a long enough time period.

I completely agree. In the same way, I wouldn't advocate doing calculations by hand instead of using a calculator in most situations. However, I do defend C / C++ on their principle of freedom; that in some situations the programmer really does know better and should be able to do whatever they want. If you really think you have a better idea or solution, you can implement it. That's getting more and more rare in modern languages, and it's something that is very important.

Personally, I can't stand C and do whatever I can to avoid writing in it. But the reasons why I dislike it have nothing to do with the principle of "trust the programmer." Like any principle, it can be abused and misused, but I think that freedom is an extremely important strength of C. (And yes, unfounded C elitism is a very real thing.)

Overly Serious · Feb 4, 2021

Any online discussion about C/C++ always turns into an argument over whether errors are a result of the language being badly designed or if the programmers are not being smart enough / diligent enough to use it properly.

Always.

It's never an argument about whether programs written in it have more / worse errors than programs written in other systems programming languages. Notice that?

cecograph · Feb 4, 2021

Ledj said:
I completely agree. In the same way, I wouldn't advocate doing calculations by hand instead of using a calculator in most situations. However, I do defend C / C++ on their principle of freedom; that in some situations the programmer really does know better and should be able to do whatever they want. If you really think you have a better idea or solution, you can implement it. That's getting more and more rare in modern languages, and it's something that is very important.

If we're talking about bounds checking, and other bugs that you expect to result in segfaults, I think it's worth bearing in mind that these bugs still don't crash your operating system (assuming you're writing userspace programs). Your operating system has its own managed environment with the help of a hardware memory controller so that it can recover when a process does something bad like this.

I bring this up because I expect the JVM and .NET to have exactly the same policy: if you do an out-of-bounds access of an array in Java or C#, I expect the virtual machine to recover. It shouldn't go down anymore than the Linux or Windows box it's running on should. And so in the case of the JVM and .NET, I expect out of bounds errors to result in runtime exceptions.

If we're not on a VM, and my language is supposed to care about performance, then I expect it to give me access to memory without bounds checks. Take the very high-level, GC'd language Ocaml, for example:

Code:

utop # Array.unsafe_get [|1;2;3|] 1000000000;;
Segmentation fault (core dumped)

I sometimes use such functions (they're generally all prefixed with "unsafe_"). But I only do so after concluding there's no better way to implement what I need. And even then, only after profiling has shown that it makes a significant performance difference. If it barely affects performance, there's no way I'm using "unsafe" functions simply because I think I know better.

By all means, give me access to the footgun if I really need it. Let me keep them safely locked away for the rest of the time.

jimsterlingspronoun · Feb 4, 2021

Full Race Replay said:
i assumed that null-terminated char arrays were considered the norm in C instead of a struct with the char array and a size_t of the length because it would take a little less memory? (Don't forget that using a struct like that is setting yourself up for a whole new world of errors to do with setting the length to the wrong number)

Dont bet on it, i have to deal with lots of interaces in real-time applications and about half of them the strings arnt null terminated and a few dont even tell you the length.

Programming thread

Least Concern

Least to meet you

Kosher Dill

Potato Chips

Ledj

Marvin

Ledj

awoo

Please be patient, I have awootism

stares at error messages

Readn' Tea Leaves

Kosher Salt

(((NaCl)))

Full Race Replay

Hi my names Full Race Replay but you can call me F

Kosher Dill

Potato Chips

Kosher Salt

(((NaCl)))

Ledj

Strange Looking Dog

Full Race Replay

Hi my names Full Race Replay but you can call me F

Ledj

Marvin

Ledj

Overly Serious

cecograph

jimsterlingspronoun