↩ go back to index

Re: Is There A Better Hard Drive Metaphor?

April 4, 2022

A response to Marginalia's “Is There A Better Hard Drive Metaphor?”

Also see JBanana's response

I sorta disagree with the characterization of [computers] have just a processor, RAM and other_devices[] as being incorrect. If you're a language that is intended to be used outside of the standard desktop computer, you'll likely encounter setups with any combination (including none) of the following devices:

In fact, I ** would say that a CPU, some minimum amount of RAM, and supporting circuitry for those two is literally the only hardware that you know is present in a computer. With modern cloud and container-based computing, I'd say that stuff running without (or with read-only) block devices is getting even more common than it was in the '70s–'90s when many contemporary OSes and programming languages were being initially designed. And of the I/O devices that may be present on the system, for most of them it turns out that just dumping bytes at them and reading bytes from them is the lowest common denominator.

It should be noted that I'm talking about lower-level languages that are able to be used in bare metal environments, and are designed to be run on literally as many systems as is feasibly possible—remember that Linux still supports m68k and that any novel and/or obscure computer architecture and hardware configuration is de facto nonviable if there's no C implementation for it. For scripting languages meant to be run in a hosted environment with a lot of abstraction around them, there isn't as good of an excuse other than it's too much work to implement the block device abstractions ourselves since the lower-level languages and OS' facilities don't do it for us. Most interpreted/scripting languages have the ability to disable stuff when being compiled so for devices without normal hard drives those capabilities could always be augmented or removed.

languages that get it mostly right

There's two languages that come to mind in terms of making dealing with disks nice, both for very different reasons.

The first is Forth, in particular the old-style blocks. It is the bare minimum required to access a block device: all you need to do is be able to read 1024 bytes into a RAM buffer, and then write those back if they're modified. Suitable for pretty much any block device setup you could cobble together (in fact CollapseOS uses blocks because they're so simple and so widely applicable, while still being efficient and fast).

At the relative other end of the spectrum is Ada. Ada is at the extreme of the issues above because it was originally commissioned by the Department of Defense for ** all of their software, so it was tasked with running on everything from mainframes to spy satellites to secretaries' desktop computers. As such it largely falls back to the standard read and write buffers of bytes for its I/O idioms.

However, it has a neat trick up its sleeve to make I/O really nice to use: built-in attributes that allow you to write any type to a stream in a portable way. This transforms the boring buffer-based I/O into a very powerful and simple mechanism that feels as native as everything else in Ada. Accessing a file is the exact same process as accessing a TLS connection, which is the exact same process as pretty much any other I/O task you could think up.

As an example, for my TLS library for Ada, I wrote little helper routines to get up to delimeter, similar to C's getdelim(3).

https://git.sr.ht/~nytpu/tlsada/tree/master/src/tls.ads#L26-59

However, due to the genericity of Ada's streams, those same functions will work for everything that uses streams, without any modification.

The only differences are how you instantiate the stream in the first place, as naturally that changes depending on what type of device you're actually dealing with. With Ada's representation clauses, you can do I/O for arbitrary protocols and formats without even having to convert the binary stream to the program's internal representation, because the internal representation is identical to the protocol's! There is also a slight incompatibility because some streams are random-access and allow you to seek to different positions, but if you always treat a stream as a linear-access stream then they're all identical.

When programming in Ada, arguably it's harder to work with dynamic memory allocation than it is general I/O, as Ada is memory safe (unless you realy cajole it) so there's a lot of hoops to jump through when safely dealing with the heap.

As an aside, in case it wasn't obvious how generic to every computer Ada tries to make itself, the standard library package Ada.Directories.Hierarchical_File_Names is optional to implement, in order to accomodate systems that do not have hierarchal file systems. The package Ada.Directories that supports operations common to both hierarchal and non-hierarchal filesystems is also optional, for systems that don't have true filesystems. Most of the Ada standard library is optional to accomodate every computer system possible, from the lowest-power embedded environments to supercomputers.

In case you can't tell Ada is one of my favorite languages ;)

Addendum

Me and Marginalia had a pretty good email conversation discussing more:

From: vlofgren 
To: alex@nytpu.com

My argument, which I perhaps didn't make all too well, is that reducing a hard drive to a linear stream of data, isn't letting you make use of the hardware.

If you want to store and quickly randomly access a lot of data, then you need a non-trivial data structure on disk. Something like a B-tree (or B+). Any language can implement such a structure, and it's amazingly fast and space efficient if you do, but it's an unimaginable pain in the ass to build these types of data structures with a programming language/operating system that only offer a tape drive simulator as means of interacting with the hardware.

Memory mapping helps but it's still severe unga bunga programming.

Modern programming languages offer virtually no tools for expressing disk- based data structures, which means for the most part, severely crippling what sort of disk based data structures we can implement to just a long list of serialized objects.

If the data is larger than the system RAM, which is fairly sensible, after all being able to handle such volumes is half the motivation behind having a hard drive; you can't simply mirror the data structure in memory. If the data is that big, you can't treat the hard drive as a long list of serialized objects.

Almost regardless of language, you're reduced to hideously gnarly explicit pointer manipulation, with zero safety guarantees, or type systems, or any sort of higher abstraction, really without anything to distinguish what you are doing from C (which is honestly more than a bit unfair to C, it provides in comparison pretty good tools for structured and safe memory access).

You can of course use a B-tree someone else has already built using these awkward tools (like any DBMS or file system). But that comes with significant overhead that may mean you need far more hardware than you would if you were able to build something suitable to your particular needs.

/ V

From: Alex // nytpu 
To: vlofgren 

Hi!

On 2022-04-05 12:26AM, vlofgren wrote:

My argument, which I perhaps didn't make all too well, is that reducing a hard drive to a linear stream of data, isn't letting you make use of the hardware.

Oh yeah, I got that out of it. I was definitely not too clear in my post (actually, now looking I forgot to write those paragraphs :|) but I was trying to say that it seems like most OSes and standard libraries seem to try and shove all I/O into the same serial stream paradigm; for consistency I guess but I agree it could be better.

Since you keep mentioning C I'll point out that (when using POSIX's file descriptor interface) basic reading and writing to a terminal, file, pipe, Unix socket, TCP socket, etc. is the exact same, which is great for consistency and being generic---"gimme a file descriptor for literally anything and I'll dump my output to it!" but also cuts out opportunity to use the protocol-specific operations and features that one would want.

Honestly it's a massive trade-off. I'd personally love to see two different interfaces, one interface that exposes all the fun stuff that each individual device/connection supports, and a second wrapper interface that uses the device-specific ones to do generic, stream-like linear I/O that is the interface we're currently stuck with. Then you'd still be able to do the simplistic "I have some text I need to output, I'll send it anywhere you want" with the generic interface, but when you do need to deal with massive, complex on-disk data structures you can use the disk-specific interface. Alternately keep the current serial interface for backwards-compatibility but include an additional "fancy" interface that's separate from the current interface.

If you want to store and quickly randomly access a lot of data, then you need a non-trivial data structure on disk. Something like a B-tree (or B+). Any language can implement such a structure, and it's amazingly fast and space efficient if you do, but it's an unimaginable pain in the ass to build these types of data structures with a programming language/operating system that only offer a tape drive simulator as means of interacting with the hardware.



You can of course use a B-tree someone else has already built using these awkward tools (like any DBMS or file system). But that comes with significant overhead that may mean you need far more hardware than you would if you were able to build something suitable to your particular needs.

For some reason, at some point everyone decided to use a completely fucked level of abstraction for disk drives (and TLS for similar reasons to disk drives, see the next paragraph). Instead of delegating the stuff to the operating system or even the drive firmware, instead they decided that it should all be implemented in the library-level abstraction layer. "Just use SQLite!"

IMO, this and many other problems with computers comes down to every single piece of a modern computer and operating system being bogged down with sixty years of historical baggage. They're obviously not backwards-compatible that far now, but they're always compatible with something slightly older, and that older thing is compatible with something older than it, and on and on until we're dealing with the exact same I/O interface that was used to access reel-to-reel tapes in 1964.

We almost could've been saved by Multics (1969), it not only invented the "everything is a file" metaphor but took it to such an extreme that there is no discernible difference between RAM and disk in terms of I/O. Of course due to design limitations you couldn't have any "file" be larger than ~1 MiB (256 36-bit words) which explains why Unix didn't adopt that design when co-opting almost everything else Multics did. If only they'd waited until 64-bit systems that could handle memory-mapping files whose sizes are literally inconceivable to humans!

Modern programming languages offer virtually no tools for expressing disk-based data structures, which means for the most part, severely crippling what sort of disk based data structures we can implement to just a long list of serialized objects.

This is what I sorta tried to address in the first section of my post, that higher-level languages should have these capabilities regardless of whether the OS supports them natively and regardless of whether the low-level language used for the compiler/interpreter supports it. It shouldn't be Postgres' job to wrap the existing shitty interface in a way that's efficient, it should be the language's.

* * *

Thanks for your reply! And sorry for the sorta tangential rambling, both your original post and reply were very thought-provoking in ways I didn't expect. I've been on a "I want to redesign everything, but better this time" kick for a while now though, so maybe it's just that :P

~nytpu

P.S. is it okay if I publish your response and this reply as an addendum to my post?

-- Alex // nytpu alex@nytpu.com gpg --locate-external-key alex@nytpu.com

From: 
To: Alex // nytpu 

Yeah, I think we're sort of on the same page then. Do publish if you want :-)