Sunday, 02 March 2025
Now I know my XYZ’s
When dealing with colour spaces, one eventually encounters the XYZ colour space. It is a mathematical model that maps any visible colour into a triple of X, Y and Z coordinates. Defined in 1931, it's nearly a century old and serves as a foundation upon which other colour spaces are built. However, XYZ has one aspect that can easily confuse programmers.
You implement a conversion function and, to check it, compare its results with an existing implementation. You search for an online converter, only to realise that the coordinates you obtain differ by two orders of magnitude. Do not despair! If the ratio is exactly 1:100, your implementation is probably correct.
This is because the XYZ colour space can use an arbitrary scale. For example, the Y component corresponds to colour’s luminance but nothing specifies whether the maximum is 1, 100 or another value. I typically use 1, such that the D65 illuminant (i.e. sRGB’s white colour) has coordinates (0.95, 1, 1.089), but a different implementation could report them as (95, 100, 108.9). (Notice that all components are scaled by the same factor).
This is similar to sRGB. In 24-bit True Colour representation, each component is an integer in the 0–255 range. However, a 15-bit High Colour uses the 0–31 range, Rec. 709 uses the 16–235 range and high-depth standards might use the 0–1023 range.
A closely related colour space is xyY. Its Y coordinate is the same as in XYZ and can scale arbitrarily, but x and y have well-defined ranges. They define chromaticity, i.e. hue, and can be calculated using the following formulæ: x = X / (X + Y + Z) and y = Y / (X + Y + Z). Both fall within the [0, 1) range.
Sunday, 23 February 2025
Regular expressions aren’t broken after all
Four years ago I proclaimed that regular expressions were broken. Two years ago I discussed this with BurntSushi and even though his expertise in the subject could not be denied, he did not manage to change my opinion. But now, two more years after that, I adjusted my stance.
Everything factual I’ve written previously is still accurate, but calling regular expressions broken might have been a bit too much of a hyperbole. There’s definitely something funky going on with regex engines but I’ve realised an analogy which makes it make sense.
Recap
In formal language theory, alternation, i.e. the | operator, is commutative. For two grammars α and β, α|β and β|α define the same language just like 1 + 2 and 2 + 1 equal the same number (there are no two different 3s, depending how they were constructed).
Nevertheless, most regex engines care about the order of arguments in an alternation. As demonstrated in my previous post, when matching the string ‘foobar’ against foo|foobar
regular expression, the regex engines will match ‘foo’ substring but when matching it against foobar|foo
they will match the entire ‘foobar’ string.
This is a bit like saying that 5x should give different results depending on whether x was constructed as x = 1 + 2 or x = 2 + 1. Of course software engineering and maths are different disciplines and things don’t directly translate between the two. Nevertheless, I felt justified in calling such regex engines broken.
Prior Art
Adjustment of my stance on the issue was thanks to other examples where programming practice clashes with its theoretical roots. Below I’ll give a handful of examples culminating with one that really changed my position.
String concatenation
Many languages use a plus symbol as a string concatenation operator, which isn’t commutative. Meanwhile in maths, plus is by convention used for commutative operations only.1 Indeed, some languages opt for using different concatenation operators: D uses ~
, Haskell uses ++
, Perl uses .
(dot), SQL uses ||
and Visual Basic uses &
to name a few examples.2
However, this is a different situation than the case of alternation in regular expressions. Using plus symbol for concatenation may be considered unfortunate, but the operation itself behaves the same way its counterpart in maths does.
Floating point numbers
Another example where maths disagrees with programming are floating point numbers. They pretend to be real numbers but in reality they aren’t even good at being rational numbers. Most notably for this discussion, addition of floating point numbers is not associative and can lead to catastrophic cancellation.3 Plus there are NaN values which infamously do not equal themselves and can really mess up array sorting if not handled properly.
However, this didn’t convince me that regular expression weren’t broken either. After all, I’m perfectly happy to call floating point numbers broken. I don’t mean by this that they are unusable or don’t solve real (no pun intended) problems. Rather this is only to emphasise that there are many details that engineer needs to be vary of when using them. This is the same sense I used the word in regards to regex engines.
Logic operations
In the end what made me more open to the ‘broken’ behaviour of regular expressions were logic operators. In many (most? all?) imperative languages, the logic or operator use a short-circuit evaluation. For example, while puts("foo") || puts("bar")
and puts("bar") || puts("foo")
C expressions evaluate to the same value (one), their behaviours differ — the first one outputs ‘foo’ and the second one outputs ‘bar’ to standard output.
This is analogous to regular expressions. When matching foo|foobar
and foobar|foo
regular expressions, the result (whether the string matches) is the same, but the side effects of the execution may differ.
Conclusion
To be clear, I still have doubts whether foo|foobar
and foobar|foo
behaving differently is the right option. However, it’s also clear to say that it’s not as broken as I used to think; rather, it’s one of the peculiarities of regexes that one needs to be aware of. And specifically, aware of how regex engine they use behave.
Friday, 21 February 2025
Privilege Separation in Go
Almost three weeks ago, I gave a talk on privilege separation in the Go programming language at FOSDEM 2025. In my talk, I was foolish enough to announce two blog posts, one I had already written and this one. After a few evenings where I found the time to work on this post, it is finally done.
The previous Dropping Privileges in Go post dealt with the privileges a computer program has. Usually, these privileges are derived from the executing user. If your user can read emails, a program executed by this user can do so as well. Then I showed certain techniques for giving up privileges on POSIX-like operating systems.
But is just limiting all privileges enough? Unfortunately, no, because there may be software projects dealing with both sensitive data as well as having dangerous code blocks. Consider an Internet-facing application that handles user credentials over a spooky committee-born protocol, perhaps even being parsed by a library notorious for security opportunities.
It is possible to split this application apart: Resulting in one restricted part having to deal with authentication, one even more restricted part handling the dangerous parser, and some communication in between. And, honestly, that is the gist of privilege separation. But since this has been a very superficial introduction, more details, specific to POSIX-leaning operating systems and the Go programming language, will follow.
Changes In Software Architecture
It might be a good idea to do some preliminary thinking before you start, to identify both the parts into which you want to divide the software, and the permissions that will be required throughout the life of those parts.
For example, a web application that manages its state in a SQLite database requires both network permissions (opening a port for incoming HTTP connections) and file system permissions (SQLite database). One could implement two subprocesses for each task and would end up with the supervising monitor process, a web server subprocesses (network permissions), and a SQLite subprocesses (file system permissions).
Taking the example from this theoretical level to the POSIX concepts introduced in my previous post, the supervising monitor process could launch two subprocesses, each running under unique user and group IDs. The network-facing subprocesses could be chrooted to an empty directory, while the database subprocess resides within a chroot containing the SQLite file. Alternatively, more modern but OS-specific features can be used to limit each process.
But, to address the second part of the initial issue, do our subprocesses really need those privileges throughout their lifetimes? The answer is very often “no”, especially if the software is designed to perform privileged operations first. An architectural goal might be to start with the most privileged operations, then drop those privileges, and continue this cycle until the main task can be performed, which might also be the most dangerous, e.g., parsing user input.
The example web server may only require the permissions to listen on a port at the beginning. After that, the subprocess should be fine with the file descriptor it already has.
Go Runtime
Nothing Go-specific has been stated so far. There are some elementary differences from C, such as Go having a runtime while C does not.
But there are low-level packages and functions in Go’s standard library that provide access to OS-specific features.
Most prominent is the frozen syscall
package, which was replaced by golang.org/x/sys/unix
for maintenance reasons and to break it free from the Go 1 compatibility promise.
So it is quite easy to port C-style privilege separation to Go, if one can adapt a bit.
Creating And Supervising Children
Even if Go has not quite reached C’s maturity, it has gone through over a decade of changes since Go version one.
Starting processes is one of the rare situations where one can actually see them.
While the syscall
package had a ForkExec
function, it did not made it into the golang.org/s/sys/unix
package.
But wait, a quick interjection first.
In C, fork(2)
and exec(3)
are two independent system calls.
Using fork(2)
creates a new process, and functions from the exec(3)
family - like execve(2)
- replace the current process with another process or, in simpler terms, start another process from an executable file in the current process.
In Go’s syscall.ForkExec
, these two low-level functions have been merged together to provide a more higher-level interface.
This was most likely done to make it harder to break the Go runtime.
In addition to merging multiple functions into one, syscall.ForkExec
also supports a wide range of specific attributes via syscall.SysProcAttr
.
These attributes includes user and group switching, chrooting and even cgroup support (v1, I’d guess).
Unfortunately, this code was frozen ten years ago and SysProcAttr
lacks documentation.
Thus, I would advise taking a look at its implementation, but not to use it.
One demotivating example might be the internal forkAndExecInChild1
function.
What to use instead?
The os
package has a Process
type and the os/exec
package provides an even more abstract interface.
From now on, I will stick to os/exec
and will do all privilege dropping by myself, even if os/exec
still supports syscall.SysProcAttr
.
For starters, a short demo to fork itself and select the child operation mode via a command line argument flag should do the trick.
A bit glue code can be written around os/exec
, resulting in the forkChild
function shown in the demo below.
package main
import (
"bufio"
"flag"
"fmt"
"log"
"os"
"os/exec"
"time"
)
// forkChild forks off a subprocess with -fork-child flag.
//
// The extraFiles are additional file descriptors for communication.
func forkChild(childName string, extraFiles []*os.File) (*os.Process, error) {
// pipe(2) to communicate child's output back to parent
logParent, logChild, err := os.Pipe()
if err != nil {
return nil, err
}
// For the moment, just print the child's output
go func() {
scanner := bufio.NewScanner(logParent)
for scanner.Scan() {
log.Printf("[%s] %s", childName, scanner.Text())
}
if err := scanner.Err(); err != nil {
log.Printf("Child output scanner failed: %v", err)
}
}()
cmd := &exec.Cmd{
Path: os.Args[0],
Args: append(os.Args, "-fork-child", childName),
Env: []string{}, // don't inherit parent's env
Stdin: nil,
Stdout: logChild,
Stderr: logChild,
ExtraFiles: extraFiles,
}
if err := cmd.Start(); err != nil {
return nil, err
}
return cmd.Process, nil
}
func main() {
var flagForkChild string
flag.StringVar(&flagForkChild, "fork-child", "", "")
flag.Parse()
switch flagForkChild {
case "":
// Parent code
childProc, err := forkChild("demo", nil)
if err != nil {
log.Fatalf("Cannot fork child: %v", err)
}
log.Printf("Started child process, wait for it to finish")
childProcState, _ := childProc.Wait()
log.Printf("Child exited: %d", childProcState.ExitCode())
case "demo":
// Child code
for i := range 3 {
fmt.Printf("hello world, %d\n", i)
time.Sleep(time.Second)
}
fmt.Println("bye")
default:
panic("This example has only one child")
}
}
While this example is quite trivial, it demonstrates how the parent process can .Wait()
for children and even inspect the exit code.
Using this information, the parent can monitor its children and raise an alarm, restart children or crash the whole execution if a child exits prematurely.
In a more concrete example, where each child should run as long as the parent, the code waits for the first child to die or for a wild SIGINT
to appear, to clean up all child processes.
Inter-Process Communication
This first example was nice and all, but a bit useless.
So far, no communication between the processes - main/parent and demo
- is possible.
This can be solved by creating a bidirectional communication channel between two processes, e.g., via socketpair(2)
.
A socketpair(2)
is similar to a pipe(2)
, but it is bidirectional (both ends can read and write) and supports certain features usually reserved to Unix domain sockets.
Using the already mentioned golang.org/x/sys/unix
package allows creating a trivial helper function.
// socketpair is a helper function wrapped around socketpair(2).
func socketpair() (parent, child *os.File, err error) {
fds, err := unix.Socketpair(
unix.AF_UNIX,
unix.SOCK_STREAM|unix.SOCK_NONBLOCK,
0)
if err != nil {
return
}
parent = os.NewFile(uintptr(fds[0]), "")
child = os.NewFile(uintptr(fds[1]), "")
return
}
The previously introduced forkChild
function came with an extraFiles
parameter, effectively setting exec.Cmd{ExtraFiles: extraFiles}
.
These extra files are then passed as file descriptors to the newly created process following the standard streams stdin, stdout and stderr with file descriptors 0, 1 and 2 respectively.
Linking socketpair
and forkChild
’s extraFiles
allows passing a bidirectional socket as file descriptor 3 to the child.
Let’s follow this idea and modify the demo
part to implement a simple string-based API that supports both the hello
and bye
commands returning a useful message back to the sender.
case "demo":
// Child code
cmdFd := os.NewFile(3, "")
cmdScanner := bufio.NewScanner(cmdFd)
for cmdScanner.Scan() {
switch cmd := cmdScanner.Text(); cmd {
case "hello":
_, _ = fmt.Fprintln(cmdFd, "hello again")
case "bye":
_, _ = fmt.Fprintln(cmdFd, "ciao")
return
}
}
This code starts by opening the third file descriptor as an os.File
, using it both for a line-wise reader and as a writer for the output.
The counterpart can be altered accordingly.
case "":
// Parent code
childCommParent, childCommChild, err := socketpair()
if err != nil {
log.Fatalf("socketpair: %v", err)
}
childProc, err := forkChild("demo", []*os.File{childCommChild})
if err != nil {
log.Fatalf("Cannot fork child: %v", err)
}
log.Printf("Started child process, wait for it to finish")
cmdScanner := bufio.NewScanner(childCommParent)
for _, cmd := range []string{"hello", "hello", "bye"} {
_, _ = fmt.Fprintln(childCommParent, cmd)
log.Printf("Send %q command to child", cmd)
_ = cmdScanner.Scan()
log.Printf("Received from child: %q", cmdScanner.Text())
}
childProcState, _ := childProc.Wait()
log.Printf("Child exited: %d", childProcState.ExitCode())
In this example, the socketpair(2)
is first created using our previously defined helper function.
The child part of the socketpair
is then passed to the newly created child process, while the parent part is then used for communication.
As an example, hello
is called twice, followed by a bye
call, expecting the child to finish afterwards.
Running this demo will look as follows. The RPC API works!
2025/02/12 22:05:45 Started child process, wait for it to finish
2025/02/12 22:05:45 Send "hello" command to child
2025/02/12 22:05:45 Received from child: "hello again"
2025/02/12 22:05:45 Send "hello" command to child
2025/02/12 22:05:45 Received from child: "hello again"
2025/02/12 22:05:45 Send "bye" command to child
2025/02/12 22:05:45 Received from child: "ciao"
2025/02/12 22:05:45 Child exited: 0
This RPC is quite simple, even for demonstration purposes. So it should be replaced by something more powerful that one would expect to find in real-world applications.
Dropping Privileges
Wait, before we get serious about RPCs, we should first introduce dropping privileges. Otherwise, doing everything that follows would be useless.
The motivation for this post started with a mental image of a program being split into several subprograms, each running only with the necessary privileges. The first part - breaking down a program - has already been addressed. Now it is time to drop privileges.
Luckily, this section will be rather short, since I felt that I have wrote more than enough on this topic in my earlier Dropping Privileges in Go post. I will assume that it was read or at least skimmed.
Looking at the demonstration program, there is a main thread that starts the child before communicating with it, and the child itself just handling some IO.
For this example, I am going to use my syscallset-go
library to restrict system calls via Seccomp BPF.
While this only works on Linux, there are mechanisms for other operating systems, as mentioned in my previous post, e.g., pledge(2)
on OpenBSD.
The main program first needs the privileges to create a socketpair(2)
and launch the other program.
After that, it still communicates over the created file descriptor and monitors the other process.
So there are two places where privileges can be dropped: initially and after launching the process.
Please take a look at this altered main part, where the two highlighted syscallset.LimitTo
blocks drop privileges.
case "":
// Parent code
if err := syscallset.LimitTo("@system-service"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
childCommParent, childCommChild, err := socketpair()
if err != nil {
log.Fatalf("socketpair: %v", err)
}
childProc, err := forkChild("demo", []*os.File{childCommChild})
if err != nil {
log.Fatalf("Cannot fork child: %v", err)
}
log.Printf("Started child process, wait for it to finish")
if err := syscallset.LimitTo("@basic-io @io-event @process"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
cmdScanner := bufio.NewScanner(childCommParent)
for _, cmd := range []string{"hello", "hello", "bye"} {
_, _ = fmt.Fprintln(childCommParent, cmd)
log.Printf("Send %q command to child", cmd)
_ = cmdScanner.Scan()
log.Printf("Received from child: %q", cmdScanner.Text())
}
childProcState, _ := childProc.Wait()
log.Printf("Child exited: %d", childProcState.ExitCode())
Same must be done for the demo
program, where the only privileged task is opening the file descriptor 3 for communication.
Afterwards, this process only needs to do IO for its simple RPC task.
case "demo":
// Child code
if err := syscallset.LimitTo("@system-service"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
cmdFd := os.NewFile(3, "")
if err := syscallset.LimitTo("@basic-io @io-event"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
cmdScanner := bufio.NewScanner(cmdFd)
for cmdScanner.Scan() {
switch cmd := cmdScanner.Text(); cmd {
case "hello":
_, _ = fmt.Fprintln(cmdFd, "hello again")
case "bye":
_, _ = fmt.Fprintln(cmdFd, "ciao")
return
}
}
Let’s take a moment to reflect on what was accomplished so far. Splitting up the process and applying different system call filters resulted in a first privilege separated demo. This was actually a lot less code than one might expect.
Using A Real RPC
After reminding ourselves of how to drop privileges, we will move on to a larger example using this technique while also using a more mature RPC.
Since I find gRPC too powerful for this task, I will stick to Go’s net/rpc
package, despite its shortcomings and feature-frozen state.
While the previous examples were very demo-like, the following one should be a bit more realistic. When finished, a child process should serve a simplified interface to a SQLite database, allowing only certain requests, while the main process should also drop privileges and serve the database’s content through a web server. To give it a realistic spin, let’s call this a web blog (or blog, as the cool kids say).
The skeleton with the forkChild
method and the -fork-child
command line argument based main
method remains.
However, the database needs some code, especially some that can be used by net/rpc
.
The following should work, creating a Database
type and two RPC methods, ListPosts
and GetPost
.
// Database is a wrapper type around *sql.DB.
type Database struct {
db *sql.DB
}
// OpenDatabase opens or creates a new SQLite database at the given file.
//
// If the database should be created, it will be populated with the posts table
// and two example entries.
func OpenDatabase(file string) (*Database, error) {
_, fileInfoErr := os.Stat(file)
requiresSetup := errors.Is(fileInfoErr, os.ErrNotExist)
db, err := sql.Open("sqlite3", file)
if err != nil {
return nil, err
}
if requiresSetup {
if _, err := db.Exec(`
CREATE TABLE posts (id INTEGER NOT NULL PRIMARY KEY, text TEXT);
INSERT INTO posts(id, text) VALUES (0, 'hello world!');
INSERT INTO posts(id, text) VALUES (1, 'second post, wow');
`); err != nil {
return nil, fmt.Errorf("cannot prepare database: %w", err)
}
}
return &Database{db: db}, nil
}
// ListPosts returns all post ids as an array of integers.
//
// This method follows the net/rpc method specification.
func (db *Database) ListPosts(_ *int, ids *[]int) error {
rows, err := db.db.Query("SELECT id FROM posts")
if err != nil {
return err
}
*ids = make([]int, 0, 128)
for rows.Next() {
var id int
if err := rows.Scan(&id); err != nil {
return err
}
*ids = append(*ids, id)
}
return rows.Err()
}
// GetPost returns a post's text for the id.
//
// This method follows the net/rpc method specification.
func (db *Database) GetPost(id *int, text *string) error {
return db.db.QueryRow("SELECT text FROM posts WHERE id = ?", &id).Scan(text)
}
Without further ado, create a database
main entry using this Database
type.
In this case, the child will not be named demo
, since it now serves a real purpose.
case "database":
// SQLite database child for posts
if err := syscallset.LimitTo("@system-service"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
rpcFd := os.NewFile(3, "")
db, err := OpenDatabase("posts.sqlite")
if err != nil {
log.Fatalf("Cannot open SQLite database: %v", err)
}
if err := landlock.V5.BestEffort().RestrictPaths(
landlock.RODirs("/proc"),
landlock.RWFiles("posts.sqlite"),
); err != nil {
log.Fatalf("landlock: %v", err)
}
if err := syscallset.LimitTo("@basic-io @io-event @file-system"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
rpcServer := rpc.NewServer()
rpcServer.Register(db)
rpcServer.ServeConn(rpcFd)
This code starts by dropping some privileges to prohibit the juicy syscalls.
Then it opens the already known file descriptor 3 next to the SQLite database located at posts.sqlite
.
Since no more files need to be accessed at this point, the privileges are being dropped again.
Starting with Landlock LSM, only allowing read-only access to Linux’ /proc
required by some Go internals and read-write access to the posts.sqlite
database file.
Next comes a stricter system call filter.
Finally, the net/rpc
is started on the third file descriptor serving the Database
.
This will block until the connection is closed, which effectively means the child has finished.
The main part now needs to follow.
As initially outlined, it should start by forking off the database
child, then drop its own privileges, and finally serving a web server.
Since there are two RPC methods Database.ListPosts
and Database.GetPost
, they can be queried from the main code and used to build the web frontend for this blog.
case "":
// Parent: Starts children, drops to HTTP server
if err := syscallset.LimitTo("@system-service"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
databaseCommParent, databaseCommChild, err := socketpair()
if err != nil {
log.Fatalf("socketpair: %v", err)
}
_, err = forkChild("database", []*os.File{databaseCommChild})
if err != nil {
log.Fatalf("Cannot fork database child: %v", err)
}
httpLn, err := net.Listen("tcp", ":8080")
if err != nil {
log.Fatalf("cannot listen: %v", err)
}
if err := landlock.V5.BestEffort().RestrictPaths(
landlock.RODirs("/proc"),
); err != nil {
log.Fatalf("landlock: %v", err)
}
if err := syscallset.LimitTo("@basic-io @io-event @network-io @file-system"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
rpcClient := rpc.NewClient(databaseCommParent)
httpMux := http.NewServeMux()
httpMux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
w.Header().Add("Content-Type", "text/html")
var ids []int
err = rpcClient.Call("Database.ListPosts", 0, &ids)
if err != nil {
http.Error(w, "cannot list posts: "+err.Error(), http.StatusInternalServerError)
return
}
_, _ = fmt.Fprint(w, `<ul>`)
for _, id := range ids {
_, _ = fmt.Fprintf(w, `<li><a href="/post/%d">Post %d</a></li>`, id, id)
}
_, _ = fmt.Fprint(w, `</ul>`)
})
httpMux.HandleFunc("GET /post/{id}", func(w http.ResponseWriter, r *http.Request) {
w.Header().Add("Content-Type", "text/html")
id, err := strconv.Atoi(r.PathValue("id"))
if err != nil {
http.Error(w, "cannot parse ID: "+err.Error(), http.StatusInternalServerError)
return
}
var text string
err = rpcClient.Call("Database.GetPost", &id, &text)
if err != nil {
http.Error(w, "cannot fetch post: "+err.Error(), http.StatusInternalServerError)
return
}
_, _ = fmt.Fprintf(w, `<h1>Post %d</h1><p>%s</p>`, id, html.EscapeString(text))
})
httpd := http.Server{Handler: httpMux}
log.Fatal(httpd.Serve(httpLn))
Like the child, the main part started by dropping system calls to the reasonable @system-service
group.
Then it creates the socketpair(2)
and launches the child, as done in the previous examples.
Another privileged operation follows, opening a TCP port to listen on :8080
.
At this point, privileges can be dropped further, again using Landlock LSM and Seccomp BPF.
Following, an RPC client connection is established against the parent’s side of the socketpair(2)
and the web server’s endpoints are defined.
All known posts should be listed on /
, which can be requested from the RPC client via the Database.ListPosts
method.
More details will then be available on /post/{id}
using the Database.GetPost
RPC method.
In the end, a http.Server
has been created and is being served on the previously created TCP listener.
Incoming requests are served and result in an RPC call to the child process that is allowed to access the database.
Passing Around File Descriptors
While this RPC works well for these constraints, what about file access? Think about an RPC API that needs to access lots of files and pass the contents from one process to another. Reading the whole file, encoding it, sending it, receiving it and decoding it does not sound very efficient. Fortunately, there is a way to actually share file descriptors between processes for POSIX.
One process can open a file, pass the file descriptor to the other process, and close the file again, while the other process can now read the file, even though it would not have no access to it. The art of passing file descriptors is a bit more obscure and beautifully explained in chapter 17.4 Passing File Descriptors of the definitive book Advanced Programming in the Unix Environment. If you are interested in the details, please check it out - PDFs are available online.
The following works as our socketpair(2)
call created a pair of AF_UNIX
sockets, effectively being Unix domain sockets
In addition to exchanging streaming data over a Unix domain socket, it is also possible to pass specific messages.
But let’s start with the code, which may look a bit cryptic on its own.
// unixConnFromFile converts a file (FD) into an Unix domain socket.
func unixConnFromFile(f *os.File) (*net.UnixConn, error) {
fConn, err := net.FileConn(f)
if err != nil {
return nil, err
}
conn, ok := fConn.(*net.UnixConn)
if !ok {
return nil, fmt.Errorf("cannot use (%T, %T) as *net.UnixConn", f, conn)
}
return conn, nil
}
// sendFd sends an open File (its FD) over an Unix domain socket.
func sendFd(f *os.File, conn *net.UnixConn) error {
oob := unix.UnixRights(int(f.Fd()))
_, _, err := conn.WriteMsgUnix(nil, oob, nil)
return err
}
// recvFd receives a File (its FD) from an Unix domain socket.
func recvFd(conn *net.UnixConn) (*os.File, error) {
oob := make([]byte, 128)
_, oobn, _, _, err := conn.ReadMsgUnix(nil, oob)
if err != nil {
return nil, err
}
cmsgs, err := unix.ParseSocketControlMessage(oob[0:oobn])
if err != nil {
return nil, err
} else if len(cmsgs) != 1 {
return nil, fmt.Errorf("ParseSocketControlMessage: wrong length %d", len(cmsgs))
}
fds, err := unix.ParseUnixRights(&cmsgs[0])
if err != nil {
return nil, err
} else if len(fds) != 1 {
return nil, fmt.Errorf("ParseUnixRights: wrong length %d", len(fds))
}
return os.NewFile(uintptr(fds[0]), ""), nil
}
Starting with the unixConnFromFile
function, which creates a *net.UnixConn
based on a generic *os.File
.
This allows converting one end of the socketpair(2)
to a Unix domain socket without losing Go’s type safety.
Then, the sendFd
function encodes the file descriptor to be sent into a socket control message and sends it over the virtual wire.
On the other side, the recvFd
function waits for such a control message, unpacks it and returns a new *os.File
to be used.
To give a little background, each process has its own file descriptor table, each entry is represented in the kernel’s file table, which is eventually mapped to a vnode entry. Thus, one process’ file descriptor 42 and another’s file descriptor 23 could actually be the same file. Same applies here, sending a file descriptor will most likely result in a different file descriptor number at the receiving end. However, the kernel will take care that this little stunt works.
Again, please consult Stevens’ Advanced Programming in the Unix Environment for more details or take a look at the implementation in the golang.org/x/sys/unix
package.
Or just accept that it works and move on.
Let’s extend the previous example and add another child process that serves pictures for each blog post to be shown. This child will need file system access to a directory of images, sending them over to the main process via file descriptor passing, as just introduced.
First, implement the new child.
case "img":
// File storage child to send pictures for posts back as a file descriptor
if err := syscallset.LimitTo("@system-service"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
rpcFd := os.NewFile(3, "")
rpcSock, err := unixConnFromFile(rpcFd)
if err != nil {
log.Fatalf("cannot create Unix Domain Socket: %v", err)
}
imgDir, err := filepath.Abs("./cmd/07-05-fork-exec-rpc/imgs/")
if err != nil {
log.Fatalf("cannot abs: %v", err)
}
if err := landlock.V5.BestEffort().RestrictPaths(
landlock.RODirs("/proc", imgDir),
); err != nil {
log.Fatalf("landlock: %v", err)
}
if err := syscallset.LimitTo("@basic-io @io-event @file-system @network-io"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
rpcScanner := bufio.NewScanner(rpcFd)
for rpcScanner.Scan() {
file, err := filepath.Abs(imgDir + "/" + rpcScanner.Text() + ".png")
if err != nil {
log.Printf("cannot abs: %v", err)
continue
}
if dir := filepath.Dir(file); dir != imgDir {
log.Printf("file directory %q mismatches, expected %q", dir, imgDir)
continue
}
f, err := os.Open(file)
if err != nil {
log.Printf("cannot open: %v", err)
continue
}
if err := sendFd(f, rpcSock); err != nil {
log.Printf("cannot send file descriptor: %v", err)
}
_ = f.Close()
}
I hope you are not bored reading this kind of code.
It gets pretty repetitive, I know.
But please bear with me and follow me through the img
child.
The first part should be quite familiar by now: forbidding some syscalls and opening file descriptor 3. But now the third file descriptor is also converted to a Unix domain socket for later use. Landlock LSM restricts directory access to the directory containing the pictures, and a stricter Seccomp BPF filter follows.
After that, a simple string-based RPC is being used again, which reads what files to open line by line. Besides a simple prefix check, the Landlock LSM filter denies everything outside the allowed directory. If the file can be opened, its file descriptor will be send back to the main process.
A few small changes are required in the main process. They are highlighted and explained below.
case "":
// Parent: Starts children, drops to HTTP server
if err := syscallset.LimitTo("@system-service"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
databaseCommParent, databaseCommChild, err := socketpair()
if err != nil {
log.Fatalf("socketpair: %v", err)
}
imgCommParent, imgCommChild, err := socketpair()
if err != nil {
log.Fatalf("socketpair: %v", err)
}
imgCommSock, err := unixConnFromFile(imgCommParent)
if err != nil {
log.Fatalf("cannot create Unix Domain Socket: %v", err)
}
_, err = forkChild("database", []*os.File{databaseCommChild})
if err != nil {
log.Fatalf("Cannot fork database child: %v", err)
}
_, err = forkChild("img", []*os.File{imgCommChild})
if err != nil {
log.Fatalf("Cannot fork img child: %v", err)
}
httpLn, err := net.Listen("tcp", ":8080")
if err != nil {
log.Fatalf("cannot listen: %v", err)
}
if err := landlock.V5.BestEffort().RestrictPaths(
landlock.RODirs("/proc"),
); err != nil {
log.Fatalf("landlock: %v", err)
}
if err := syscallset.LimitTo("@basic-io @io-event @network-io @file-system"); err != nil {
log.Fatalf("seccomp-bpf: %v", err)
}
rpcClient := rpc.NewClient(databaseCommParent)
httpMux := http.NewServeMux()
httpMux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
// Same as before.
})
httpMux.HandleFunc("GET /post/{id}", func(w http.ResponseWriter, r *http.Request) {
w.Header().Add("Content-Type", "text/html")
id, err := strconv.Atoi(r.PathValue("id"))
if err != nil {
http.Error(w, "cannot parse ID: "+err.Error(), http.StatusInternalServerError)
return
}
var text string
err = rpcClient.Call("Database.GetPost", &id, &text)
if err != nil {
http.Error(w, "cannot fetch post: "+err.Error(), http.StatusInternalServerError)
return
}
_, _ = fmt.Fprintln(imgCommParent, id)
imgFd, err := recvFd(imgCommSock)
if err != nil {
http.Error(w, "cannot fetch img: "+err.Error(), http.StatusInternalServerError)
}
defer imgFd.Close()
_, _ = fmt.Fprintf(w, `<h1>Post %d</h1><p>%s</p>`, id, html.EscapeString(text))
_, _ = fmt.Fprint(w, `<img src="data:image/png;base64,`)
encoder := base64.NewEncoder(base64.StdEncoding, w)
io.Copy(encoder, imgFd)
encoder.Close()
_, _ = fmt.Fprint(w, `" />`)
})
httpd := http.Server{Handler: httpMux}
log.Fatal(httpd.Serve(httpLn))
The first changes are to create another socketpair(2)
and fork off the second child.
Except for also creating a unixConnFromFile
, they are analogous to the startup code for the first child process.
The interesting part happens inside the HTTP handler for /post/{id}
.
If the SQLite database knows of a post for the id
, that id
is written to the parent’s socketpair(2)
end to be read the by the img
child’s RPC loop.
The code then waits to receive a file descriptor over the Unix domain socket created on the same connection.
After receiving the file descriptor, its content is copied into a base64 encoder and written as an encoded image back to the web response.
This example now has two subprocesses, each running with differently restricted privileges. An RPC mechanism in between allows inter-process communication, including the passing of file descriptors. At this point, it is safe to say that privilege separation has been achieved.
What’s Next?
This post was the logical successor to Dropping Privileges in Go. While the first one described how to drop various privileges, this one focused on architectural changes to drop privileges more granularly. Adding privilege separation to the toolbox of software architectures makes it possible to build more robust software under the assumption that the software will be pwned some day.
The examples shown here and more are available in a public git repository at codeberg.org/oxzi/go-privsep-showcase. Before creating these explicit examples, I have toyed with these technologies in a “research project” of mine, called gosh. Please feel free to take a look at it for more inspirations.
There are still some related topics I plan to write about, but I would not go so far and create any announcements. Stay tuned.
Sunday, 16 February 2025
The weekend after I ♥ Free Software Day 2025 – Sunday
This is part II of the I Love Free Software Day blogpost. More specifially it is about the game Veloren which I played once three years ago, when the pandemic was still ongoing. My computer that I had at time did not have a good GPU, so I used my brothers old computer with an NVIDIA card. A few years I got into VR which is only possible in freedom thanks to the libsurvive project. To be able to play VR games I baught an AMD graphics card, before that I used my Talos II’s built-in ASpeed graphics. With the new GPU I can drive up to 4 monitors. All the games that I run on my Talos II are free software:
VRChat is popular in the Transgender community, but I avoid it since it is non-free. V-Sekai is one of the free software replacements. While we need binaries to run a program on a computer, we also need source code for a program to qualify as free software. I am using part of the V-Sekai code in my BeatSaber clone called BeepSaber as I want full body tracking controlling an animated VRM avatar. While I usually present masculine as an enby
(I never wear a beard), my VRM avatar will be female
. For me that is a way to try out genders different from my gender assigned at birth.
There is another free software VR game that I would like to play. It is called VoxelWorksQuest and its author is the same person that wrote BeepSaber. Unfortunately it is unmaintained, so I decided to replace it with with a VR port of minetest (now called luanti). Minetest is similar to Minecraft and Veloren, but it is both free software and able to run on old computers with built-in freedom respecting GPUs. Luanti is also written in a programming language called lua, which I use at work a lot. In the next few weeks I will be continuing to work on Minetest XR adding missing important features to make the game playable.
Next month I will go to the Chemnitzer Linux-Tage where I also expect to meet many queers . There is also a “Gaming Night”
and many interesting talks, including one about KiCad (the tool that I use for building my own hardware), BTRFS (my preferred filesystem), Banking apps (unfortunately not GNU Taler) and Passkeys (allowing passwordless login). In the meantime I will watch recorded videos from FOSDEM, starting with “Declarative and Minimalistic Computing” then moving to “Open Hardware and CAD/CAM”.
Solana signature count limit
Implementing Solana IBC bridge, I had to deal with various constraints of the Solana protocol. Connecting Solana to Composable Foundation’s Picasso network, I needed to develop an on-chain light client capable of validating Tendermint blocks. This meant being able to validate 50 signatures in a single transaction.
Turns out that’s not possible on Solana and it’s not exactly because of the execution time limit. The real culprit is the transaction size limit which I’ve discussed previously. This article describes how signature verification is done on Solana, the limit on the number of signatures that can be verified in a single transaction and how that limit can be worked around.
Like before, this article assumes familiarity with Solana and doesn’t serve as a step-by-step guide for Solana development. Nonetheless, examples from the solana-sigverify
repository can be used as a starting point when using the signature verification mechanism described below.
Cryptographic functions on Solana
Solana programs can perform any computation a regular computer can do. It is possible to implement cryptographic functions as part of a smart contract and have them executed on the blockchain. However, that’s a quick way to run into issues.
Calculating a 256-bit SHA2 digest of a 100-byte buffer takes 14 thousand compute units (CU). Meanwhile, Solana programs have a hard limit of 1.4 million CU. In other words, hashing 100 bytes takes up 1% of the absolute maximum computation that a smart contract can perform in a single transaction. Situation is even worse with more advanced cryptographic functions: a signature verification blows through the compute limit.
Thankfully, Solana offers native methods for popular cryptographic primitives. In particular, a sol_sha256
system call (accessible through solana_program::hash::hashv
function) computes 256-bit SHA2 hash with cost of only 267 CU to hash a 100-byte buffer. One might expect a similar sol_ed25519_verify
system call, however signature verification is done in much more convoluted way on Solana.
Signature verification
Solana includes a handful of native programs. Among those are programs, such as Ed25519 program, which perform signature verification. To check a signature, caller creates a transaction including two instructions: one calling the native signature verification program and another calling a smart contract. If a signature is invalid, the whole transaction fails an the smart contract is not called.1
For example, consider transaction 56QjWeDDDX4Re2sX… which has three instructions: The first adjust compute unit limit, the second invokes the Ed25519 program and the final one calls a program which can check that signature has been verified. An annotated instruction data of the Ed25519 program invocation is shown below:
Offset | Bytes | Notes |
---|---|---|
0x00 | 01 00 80 00 ff ff c0 00 ff ff 10 00 70 00 ff ff | Request header |
0x10 | 6f 08 02 11 e5 61 6a 00 00 00 00 00 22 48 0a 20 | Signed message (0x70 bytes) |
a0 c2 78 ea ac 5e ba ce cf f5 6b 0a 33 2b 12 60 | ||
78 8a e9 2c 3e d9 17 14 c0 fe c3 71 ca 79 57 a7 | ||
12 24 08 01 12 20 61 43 1a 05 af 4d 46 64 6f 71 | ||
0b 59 f7 c3 c1 6f ca c6 10 d2 05 63 77 97 d0 4d | ||
ad 15 ed 32 ee b7 2a 0c 08 82 f9 fd b6 06 10 b2 | ||
b7 a5 95 01 32 0a 63 65 6e 74 61 75 72 69 2d 31 | ||
0x80 | 1a fd b3 c7 85 6c 16 82 2a 59 f6 3e d8 d3 fd 7a | Signature |
7b ab bd 8b 77 c1 0a 90 2c 38 8c 06 69 88 62 cd | ||
22 b2 4f 7e b5 cf 13 7c 97 00 d2 4d e3 da 08 1d | ||
f6 ad 3f 05 33 6e 35 47 15 5d 59 b8 fe e9 e6 07 | ||
0xc0 | c2 aa 20 50 7f 78 d5 49 f6 85 50 9d d0 8b 64 89 | Public key |
80 60 5a d2 ad 3e 90 b3 e8 0b 5d 24 b2 14 22 7b |
Signature count limit
Since public key, signature and signed message are stored in the instruction data, number of signatures that can be verified is subject to the 1232-byte transaction size limit. Accounting for overhead leaves less than 1100 bytes for the instruction data. To specify a signature for verification, a 14-byte header, 32-byte public key, 64-byte signature and the message are needed. Even assuming an empty message, that’s 110 bytes per signature which means at most ten signatures in a single transaction.2
For Solana IBC I needed to verify Tendermint block signatures. Tendermint validators timestamp their signatures and as a result each signature is for a different message of about 112 bytes. That gives a maximum of ⌊1100 / (14 + 32 + 64 + 112)⌋ = 4 signatures per transaction. Meanwhile, as mentioned at the start, I needed to verify about 50 of them.
The sigverify
program
To address this limitation, signatures can be verified in batches with results aggregated into a signatures account that can be inspected later on.3 This scheme needs i) a smart contract capable of doing the aggregation and ii) an interface for other smart contracts to interpret the aggregated data.
The first point is addressed by the sigverify
program. Using regular Solana way of signature verification, it observes what signatures have been checked in the transaction. It then aggregates all that information into a Program Derived Address (PDA) account it owns. As the owner, only the sigverify
program can modify the account thus making sure that aggregated information stored in it is correct.
Even though the sigverify
owns the account, it internally assigns it to the signer such that users cannot interfere with each other’s signatures account.
RPC Client
A convenient way to call the sigverify
program is to use the client library functions packaged alongside it. To use them, first add a new dependency to the RPC client (making sure client
feature is enabled):
[dependencies.solana-sigverify] git = "https://github.com/mina86/solana-sigverify" features = ["client"]
The crate has a UpdateIter
iterator which generates instruction pairs that need to be executed to aggregate the signatures into the signatures account. The account can be reused but in that case each batch of signatures needs to use a different epoch. If account is freed each time, epoch can be set to None. Note that all the instruction pairs returned by UpdateIter
can be executed in parallel transactions.
// Generate list of signatures to verify. let entries: Vec<Entry> = signatures.iter() .map(|sig| Entry { pubkey: &sig.pubkey, signature: &sig.signature, message: &sig.message, }) .collect(); // When signatures account is reused, each use needs // a different epoch value. Otherwise it can be None. let epoch = std::time::SystemTime::now() .duration_since(std::time::UNIX_EPOCH) .unwrap() .as_nanos() as u64; let epoch = Some(epoch); // Generate all necessary instructions and send them to // Solana. UpdateIter splits signatures into groups as // necessary to call sigverify. let (iter, signatures_account, signatures_bump) = solana_sigverify::instruction::UpdateIter::new( &solana_sigverify::algo::Ed25519::ID, SIGVERIFY_PROGRAM_ID, signer.pubkey(), SIGNATURES_ACCOUNT_SEED, epoch, &entries, )?; // To speed things up, all of those instruction pairs can // be executed in parallel. for insts in iter { let blockhash = client.get_latest_blockhash()?; let message = Message::new_with_blockhash( &insts, Some(&signer.pubkey()), &blockhash, ); send_and_confirm_message( client, signer, blockhash, message)?; } // Invoke the target smart contract. signatures_account is // the account with aggregated signatures. It will need to // be passed to the smart contract. todo!("Invoke the target smart contract"); // Optionally free the account. Depending on usage, the // account can be reused (to save minor amount of gas fees) // with a new epoch as described. let instruction = solana_sigverify::instruction::free( SIGVERIFY_PROGRAM_ID, signer.pubkey(), Some(signatures_account), SIGNATURES_ACCOUNT_SEED, signatures_bump, )?; send_and_confirm_instruction(client, signer, instruction)?;
The signatures are aggregated into an account whose address is stored in signatures_account
. The SIGNATURES_ACCOUNT_SEED
allows a single signer to maintain multiple accounts if necessary.
The target smart contract
The target smart contract needs to be altered to support reading the aggregated signatures. Code which helps with that is available in the solana-sigverify
crate as well. First, add a new dependency to the program (making sure lib
feature is enabled)):
[dependencies.solana-sigverify] git = "https://github.com/mina86/solana-sigverify" features = ["lib"]
The crate has a Verifier
family of types which can interface with the Solana’s native signature verification programs (like the Ed25519 program) and the aggregated signatures. This flexibility allows the smart contracts using this type to nearly transparently support normal Solana signature verification method or the signature aggregation through sigverify
program.
/// Address of the sigverify program. This must be set /// correctly or the signature verification won’t work. const SIGVERIFY_PROGRAM_ID: Pubkey = solana_program::pubkey!("/* … */"); let mut verifier = solana_sigverify::Ed25519Verifier::default(); // To check signatures from a call to a native signature // verification program, the verifier must be initialised // with Instructions sysvar. The the native program call // must immediately preceding the current instruction. let instructions_sysvar = /* … */; verifier.set_ix_sysvar(instructions_sysvar)?; // To check signatures aggregated in a signatures account, // the verifier must be initialised with the account. For // security, expected sigverify program ID must be specified // as well. verifier rejects signatures accounts not owned // by the sigverify program. let signatures_account = /* … */; verifier.set_sigverify_account( account, &SIGVERIFY_PROGRAM_ID)?; // To verify a signature, call verify method. if !verifier.verify(message, pubkey, signature)? { panic!("Signature verification failed"); }
Conclusion
It’s perhaps counter intuitive that the transaction size limit constraints how many signatures can be verified in a single Solana transaction, but because of the design of the native signature verification programs, it is indeed the case. Thankfully, with some engineering the restriction can be worked around.
This article introduces a method of aggregating signatures across multiple transactions with the help of a sigverify
program and library functions available in solana-sigverify
repository. The library code supports regular Solana signature verification method as well as the aggregated signatures providing flexibility to the smart contract.
Lastly, the repository also provides solana-native-sigverify
crate which offers APIs for interacting with the Solana native signature verification programs.
Saturday, 15 February 2025
The weekend after I ♥ Free Software Day 2025 – Saturday
Yesterday I contributed to the Free Software Directory, unfortunately I did not have enough time to write my intended blog post. So I am doing that now. #ilovefs
Free Software is a matter of freedom, not price. A program is free software if the program’s users have the four essential freedoms:
The freedom to run the program as you wish, for any purpose (freedom 0).
The freedom to study how the program works and to make changes (freedom 1).
The freedom to redistribute copies so you can help others (freedom 2).
The freedom to distribute copies of your modified versions to others (freedom 3).
I have been developing free software both professionally and as a hobby for long times. Recently I came out as genderqueer (also known as non-binary) at the first
meeting of a GNU/Linux user group I cofounded. On the weekend before I was at FOSDEM where I met many trans and non-binary individuals, some of them at the Guix Days fringe event.
I also have used software written by trans and non-binary contributors. This includes the FPGA toolchain based on yosys and nextpnr, GNU MediaGoblin, SlimeVR and Monado to just name a few. For an upcoming hardware project (A Lighthouse tracked VR headset with RYF in mind), I will be using an ICE40 FPGA. For GNU MediaGoblin, I plan to setup my own instance again. I spefifically backed SlimeVR because the software is portable and can be used on a Freedom RespectingTalos II. When I saw the BLÃ…HAJ in the SlimeVR video I immediately recognized it is part of the Transgender culture. I also saw more than one BLÃ…HAJ at FOSDEM. Finally there is Monado, where one of the lead developers and some other contributors do identify as non-binary and/or transgender. Many of those persons were not out when I started using the software. Over the time, I realized that I am trans too, more specifically non-binary.
Part II of the blog post will be done tomorrow.
Friday, 14 February 2025
Thank you for the editor of the beast
On today's "I love Free Software Day" I would like to thank again Bram Moolenaar, creator of the widely used Vim text editor and all the people active in the community around VIM.
While many people in the movement for software freedom know about VI, VIM, NeoVim, or other flavours, for many people outside our movement this software is installed on their computers, but often without them actively installing it or noticing it as a separate component. There are no exact numbers, but with all the installations on GNU/Linux (servers, virtual machines, desktop, ... ), BSD and Unix systems, MacOS, Microsoft Windows and the Windows Subsystem for Linux) I feel comfortable to claim there are way more than 1 billion installations.
Matthias and Raphael at SFSCON 2024 asking all people in the audience to stand up if are using Vim, or have used Vim in the past. Matthias showing the VI M sign with his fingers - CC-BY-SA 4.0 NOI
While it is a "hidden software" for many out there, it is one of the most important tools for other; including myself. IIRC I started using the text editor Vim in 1999 when I installed my first GNU/Linux distribution. In an e-mail from July 2001, a friend complains that he had some encoding issues with a file I sent him, where I told him that I wrote it with Vim, in November 2001 I sent parts of my Vim config file to another person from our Free Software user group, and my first public post seems to be in 2002 when I engaged in discussions about Vim on the German Debian User mailing list (yes, I would today engage differently in such discussions).
For 25 years, a quarter of a decade, I almost daily write notes and e-mails using the Vim key bindings, all the papers at university (through Latex), almost everything that is published from my had its origin in a text file I edited the "Vim way". The commands are meanwhile part of my muscle memory.
In the preparation for the European SFS Award 2024 I had the honour of talking to many people who closely worked with Bram and with his family. To express my gratitude, let me quote the laudatio again, which was a quite emotional moment for me (if you prefer, you can also watch the video recording).
Matthias: It is an honor to present the European SFS Award 2024. The FSFE and LUGBZ worked together again this year to find a winner from all nominations. This year’s European SFS Award goes to someone whose work transformed how many interact with computers, creating a tool for Free Software contributors, developers, and creators. A tool that new users might be a little afraid of because it can be tricky to exit.
Raphael: (Yes, you may know the software we’re talking about.) A piece of code that makes every keystroke feel like a power move, where “Esc” is the most important key on your keyboard. Since its launch in 1991, this software has spread across more than 15 operating systems and is installed on millions of computers around the world.
Matthias: For our winner, efficiency of computer users was crucial. His mantra was: “Detect inefficiencies, find a quicker way, make it a habit!” and he helped many people to how to actually accomplish this. He went on to help those he met on mailing lists, at conferences like SFSCON in 2009, or at his workplace. He even talked to public administrations, so they actually use and thereby benefit from Free Software. He wanted to ensure that all software which is procured by public administrations is published under a Free Software license for the good of society.
Raphael: Educating others to empower them was also important for him outside of the technology field. He helped children in Uganda -- who often lost their parents due to HIV -- to get education at the Kibaale Community Centre. He enabled school education for many of them so they can take care about themselves and their families in the long run. He founded an NGO to collection donations for this work, even on his work desk there was a piggy bank so that visitors can easily donate.
Matthias: There was a huge online rivalry between the users of his software and those on the other side: those who used another "operating system" and who called his software the "editor of the beast". This rivalry became an enduring part of hacker culture and the Free Software community. A huge fan of Monty Pythons, this year's winner did not shy away from engaging in such banter.
Raphael: His dedication was enormous. His family will not forget the moments, in which he disappeared on Christmas day, because he "needed to fix some bugs". It gave him great pleasure to develop and use his software, and he wanted to help others to also experience this joy. "If you are happy, I am happy!" was one of his sayings. He took every opportunity to work on his projects, even while in the hospital.
Matthias:With his death on 3 August 2023 the Free Software community lost a person who enabled thousands of people to contribute efficiently to software freedom. We regret that he was not able to live longer with his beloved turtles, finishing his plans for a vacuum robot that could clean stairways, fixing bugs, implementing new features for the users of his software, and being here with us.
Raphael: For his remarkable contributions to software freedom the European SFS Award 2024 goes posthumously to Bram Moolenaar, the creator of Vi IMproved -- or VIM.
Matthias: So, please join us in a big round of applause for Bram Moolenaar.
On today's "I love Free Software" day, let me thank again Bram Moolenaar for all the mentioned work. Thank you, Christian Brabandt and the other contributors who took the coordination of Vim after Bram passed away. Thank you to Sven Guckes (I wrote about his death in 2022) who helped me and others with many Vim questions plus showed me how to do the "VI M" with my fingers like in the picture above, and thank you to all the people from projects like NeoVIM, Nvi, Busybox Vi, who develop and maintain their Vi flavour.
Matthias thanking Bram Moolenaar on stage at SFSCON 2024 with picture of Bram coding on Vim in the background - CC-BY-SA 4.0 NOI
Monday, 10 February 2025
KDE Gear 25.04 release schedule
This is the release schedule the release team agreed on
https://community.kde.org/Schedules/KDE_Gear_25.04_Schedule
Dependency freeze is in around 3 weeks (March 6) and feature freeze one
after that. Get your stuff ready!
Sunday, 09 February 2025
Solana transaction size limit
Solana transactions are limited to 1232 bytes which was too restrictive when I was implementing Solana IBC bridge while working at Composable Foundation. The smart contract had to be able to ingest signed Tendermint block headers which were a few kilobytes in size.
To overcome this obstacle, I’ve used what I came to call chunking. By sending the instruction data in multiple transactions (similarly to the way Solana programs are deployed), the Solana IBC smart contract is capable of working on arbitrarily-large instructions. This article describes how this process works and how to incorporate it with other smart contracts (including those using the Anchor framework).
This article assumes familiarity with Solana; it’s not a step-by-step guide on creating Solana programs. Nevertheless, code samples are from examples available in the solana-write-account
repository (a chsum-program
Solana program and chsum-client
command line tool used to invoke said program) which can be used as a starting point when incorporating the chunking method described below.
Demonstration of the problem
First let’s reproduce the described issue. Consider the following (somewhat contrived) chsum
Solana program which computes a simple parameterised checksum. When called, the first byte of its instruction data is the checksum parameter and the rest is the data to calculate the checksum of.
solana_program::entrypoint!(process_instruction); fn process_instruction<'a>( _program_id: &'a Pubkey, _accounts: &'a [AccountInfo], instruction: &'a [u8], ) -> Result<(), ProgramError> { let (mult, data) = instruction .split_first() .ok_or(ProgramError::InvalidInstructionData)?; let sum = data.chunks(2).map(|pair| { u64::from(pair[0]) * u64::from(*mult) + pair.get(1).copied().map_or(0, u64::from) }).fold(0, u64::wrapping_add); solana_program::msg!("{}", sum); Ok(()) }
The program works on data of arbitrary length which can be easily observed by executing it with progressively longer buffers. However, due to aforementioned transaction size limit, eventually the operation fails:
$ chsum-client 2 ab Program log: 292 $ chsum-client 2 abcdefghijklmnopqrstuvwxyz Program log: 4264 $ data=… $ echo "${#data}" 1062 $ chsum-client 2 "$data" RPC response error -32602: decoded solana_sdk::transaction::versioned::VersionedTransaction too large: 1233 bytes (max: 1232 bytes)
The write-account
program
To solve the problem, the overlarge instruction can be split into smaller chunks which can be sent to the blockchain in separate transactions and stored inside of a Solana account. For this to work two things are needed: i) a smart contract which can receive and concatenate all those chunks; and ii) support in the target smart contract for reading instruction data from an account (rather than from transaction’s payload).
The first requirement is addressed by the write-account
program. It copies bytes from its instruction data into a given account at specified offset. Subsequent calls allow arbitrary (and most importantly arbitrarily-long) data to be written into the account.
RPC Client
The simplest way to send the chunks to the smart contract is to use client library functions packaged alongside the write-account
program. First, add a new dependency to the RPC client:
[dependencies.solana-write-account] git = "https://github.com/mina86/solana-write-account" features = ["client"]
And with that, the WriteIter
can be used to split overlong instruction data into chunks and create all the necessary instructions. By default the data
is length-prefixed when it’s written to the account. This simplifies reuse of the account since the length of written data can be decoded without a need to resize the account.
// Write chunks to a new account let (chunks, write_account, write_account_bump) = solana_write_account::instruction::WriteIter::new( &WRITE_ACCOUNT_PROGRAM_ID, signer.pubkey(), WRITE_ACCOUNT_SEED, data, )?; for inst in chunks { send_and_confirm_instruction(client, signer, inst)?; } // Invoke the target smart contract. write_account is the // account with the instruction data. It will need to be // passed to the smart contract as last account. todo!("Invoke the target smart contract"); // Optionally, free the account to recover deposit let inst = solana_write_account::instruction::free( WRITE_ACCOUNT_PROGRAM_ID, signer.pubkey(), Some(write_account), WRITE_ACCOUNT_SEED, write_account_bump, )?; send_and_confirm_instruction(client, signer, inst)?;
The data is copied into a Program Derived Address (PDA) account owned by the write-account
program. The smart contract makes sure that different signers get their own accounts so that they won’t override each other’s work. WRITE_ACCOUNT_SEED
allows a single signer to maintain multiple accounts if necessary.
The address of the account holding the instruction data is saved in the write_account
variable. But before it can be passed to the target smart contract, the smart contract needs to be altered to support such calling convention.
Note on parallel execution
With some care, the instructions returned by WriteIter
can be executed in parallel thus reducing amount of time spent calling the target smart contract. One complication is that the account may need to be resized when chunks are written into it. Since account can be increased by only 10 KiB in a single instruction, this becomes an issue if trying to write a chunk which is over 10 KiB past the end of the account.
One way to avoid this problem, is to group the instructions and executed them ten at a time. Once first batch executes, the next can be send to the blockchain. Furthermore, if the account is being reused, it may already be sufficiently large. And of course, this is not an issue with the data doesn’t exceed 10 KiB.
The target smart contract
The Solana runtime has no built-in mechanism for passing instruction data from an account. Smart contract needs to explicitly support such calling method. One approach is to always read data from an account. This may be appropriate if the smart contract usually deals with overlong payloads. A more flexible approach is to read instruction from the account if instruction data in the transaction is empty. This can be done by defining a custom entry point:
/// Solana smart contract entry point. /// /// If the instruction data is empty, reads length-prefixed data /// from the last account and treats it as the instruction data. /// /// # Safety /// /// Must be called with pointer to properly serialised /// instruction such as done by the Solana runtime. See /// [`solana_program::entrypoint::deserialize`]. #[no_mangle] pub unsafe extern "C" fn entrypoint(input: *mut u8) -> u64 { // SAFETY: Guaranteed by the caller. let (prog_id, mut accounts, mut instruction_data) = unsafe { solana_program::entrypoint::deserialize(input) }; // If instruction data is empty, the actual instruction data // comes from the last account passed in the call. if instruction_data.is_empty() { match get_ix_data(&mut accounts) { Ok(data) => instruction_data = data, Err(err) => return err.into(), } } // Process the instruction. process_instruction( prog_id, &accounts, instruction_data, ).map_or_else( |error| error.into(), |()| solana_program::entrypoint::SUCCESS ) } /// Interprets data in the last account as instruction data. fn get_ix_data<'a>( accounts: &mut Vec<AccountInfo<'a>>, ) -> Result<&'a [u8], ProgramError> { let account = accounts.pop() .ok_or(ProgramError::NotEnoughAccountKeys)?; let data = alloc::rc::Rc::try_unwrap(account.data) .ok().unwrap().into_inner(); if data.len() < 4 { return Err(ProgramError::InvalidInstructionData); } let (len, data) = data.split_at(4); .ok_or(ProgramError::InvalidInstructionData)?; let len = u32::from_le_bytes(len.try_into().unwrap()); data.get(..(len as usize)) .ok_or(ProgramError::InvalidInstructionData) } solana_program::custom_heap_default!(); solana_program::custom_panic_default!();
The solana-write-account
crate packages all of that code. Rather than copying the above, a smart contract wanting to accept instruction data from an account can add the necessary dependency (this time with the lib
Cargo feature enabled):
[dependencies.solana-write-account] git = "https://github.com/mina86/solana-write-account" features = ["lib"]
and use entrypoint
macro defined there (in place of solana_program::entrypoint
macro):
solana_write_account::entrypoint!(process_instruction);
Anchor framework
This gets slightly more complicated for anyone using the Anchor framework. The framework provides abstractions which are hard to break through when necessary. Any Anchor program has to use the anchor_lang::program
macro which, among other things, defines the entrypoint
function. This leads to conflicts when a smart contract wants to define its own entry point.
Unfortunately, there’s no reliable way to tell Anchor not to introduce that function. To add write-account
support to the Solana IBC bridge I had to fork Anchor and extend it with the following change which introduces support for a new custom-entrypoint
Cargo feature:
diff --git a/lang/syn/src/codegen/program/entry.rs b/lang/syn/src/codegen/program/entry.rs index 4b04da23..093b1813 100644 --- a/lang/syn/src/codegen/program/entry.rs +++ b/lang/syn/src/codegen/program/entry.rs @@ -9,7 +9,7 @@ pub fn generate(program: &Program) -> proc_macro2::TokenStream { Err(anchor_lang::error::ErrorCode::InstructionMissing.into()) }); quote! { - #[cfg(not(feature = "no-entrypoint"))] + #[cfg(not(any(feature = "no-entrypoint", feature = "custom-entrypoint")))] anchor_lang::solana_program::entrypoint!(entry); /// The Anchor codegen exposes a programming model where a user defines /// a set of methods inside of a `#[program]` module in a way similar
The feature needs to be enabled in the Cargo.toml
of the Anchor program which wants to take advantage of it:
[features] default = ["custom-entrypoint"] custom-entrypoint = []
Conclusion
To achieve a sub-second finality Solana had to introduce significant constraints on the protocol and smart contract runtime environment. However, with enough ingenuity at least some of those can be worked around.
This article introduces a way to overcome the 1232-byte transaction size limit through the use of a write-account
helper program and library functions available in solana-write-account
repository. The solution is general enough that any Solana program, including those using Anchor framework, can use it.
PS. Interestingly, the Solana size limit also affects how many signatures can be verified in a single transaction. I discuss that problem and its solution in another article.
Monday, 03 February 2025
Back from FOSDEM
This weekend I visited FOSDEM and the Guix Days in Brussels. On Thursday I went to the Guix Days, a FOSDEM Fringe event. There I met many GNU Guix and Spritely Goblins contributors. I have been using the Guix System for many years, and I have plans to run Guix on smartphones. Spritely Goblins looks as interesting as GNUnet and Taler for me, but I did hot have a deeper look yet. From my eduction in computer science, I know both distributed systems and actor models. Goblins can be used with both Guile and Racket. I prefere the Guile variant, since I already know Guile from Guix. We discussed why “guix pull” is so slow. I also had proposed a talk about Guix System on Alternative target architectures (ARM, RISCV, POWER etc.) including smartphones and my Talos II workstation, currently running Debian GNU/Linux. My second proposal was about Virtual Reality. Unfortunately both poposals have been rejected as there were too many good talks.
On Saturday, the first Day of FOSDEM, I was in the Android Open Source Project and FOSS on Mobile Devices devrooms. There were many interesting talks including “Forking Android considered harmful” and “Towards a purely open AOSP: Adding Android-like functionality to AOSP”. On the second half of the day I went to the “FOSS on Mobile Devices” devroom, where I met Caleb Connolly who works on Qualcomm Snapdragon 845 Mainline Linux. I also met them at the last FrOSCon where I presented my port of Minetest to the LibreM5, a smartphone running GNU/Linux. The first talk in the FOSS on Mobile Devices devroom was “Mainline vs libhybris: Technicalities, down to the buffer”, I also watched most others. I have used libhybris once, as part of Droidian running on my OnePlus 6T, a SDM845 phone. Due to bugs I switched back to Android, later I found out that my phone was not supported by Droidian anymore. By contrast, Mobian uses the Mainline Kernel and does not use Android drivers.
On Sunday I started in the JavaScript devroom, then moved to the “Declarative & Minimalistic Computing” devroom, where I met many of the those who I saw at Guix Days, on the days before FOSDEM. First I listened to the “Minimalist web application deployment with Scheme” talk, then I ate my second a vegan burger at FOSDEM(the first one one was at Saturday). Unfortunately most of the food offered at FOSDEM is still not vegan. Next I went to the Luanti and FOSS on Mobile Devices booths in the K building and talked with the developers about some of the projects that I am currently working on. After that I went back from where I came from to listen to the “The Whippet Embeddable Garbage Collection Library” talk before the most interesting talks about the GNU Shepherd and Spritely Goblins started. Unfortunately I missed the talks from Jessica Tallon and Christine Lemmer-Webber, as I went to the Robotics and Simulation devroom where I had two interesting talks about the “Open 3D Engine” and the “Valve’s Lighthouse 2.0 Technology”. While I prefer Godot for VR, I still liked both talks. I am also working on my VR headset using the “Lighthouse positioning deck” from Bitcraze and a postmarketOS compatible smartphone, currently running both Android and Mobian.
After those two talks I went to the Keynotes at Janson (the big room). Both “The Growing Body of Proprietary Infrastructure for FOSS Development: Repeating Bad History” and “How we are defending Software Freedom against Apple at the EU’s highest court” were pretty interesting talks from Free Software activists who I fully support. Finally I went to room near Jansen where people were working on Virtual Reality (VR). I did not know about Overte e.V. and their Social VR projects yet. I told one of the developers that I use components of V-Sekai in my BeatSaber clone. V-Sekai is more or less a clone of the non-free VRChat game, one of those games that are popular in the transgender community. I am nonbinary and trans myself, and met many other queers at both FOSDEM and Guix Days. Not fully out yet, I enjoy social VR where I can try out a feminine avatar, while presenting mostly masculine in real life. At FOSDEM I wore a skirt and a dress, makeup and cat ears. Next month I’ll plan to go to the Chemnitzer Linux-Tage conference.
Saturday, 01 February 2025
Dropping Privileges in Go
Computer programs may do lots of things, both intended and unintended. What they can do is limited by their privileges. Since most operating systems execute programs as a certain user, the program has all the user’s precious privileges.
To take a concrete example, if a user has an SSH private key laying around and runs, e.g., a chat program, then this program is able to read the private key even though it has nothing to do with it. Assuming that this chat is exploitable, then an attacker might instruct the chat through a crafted message to exfiltrate the private key.
Maybe not the issue’s core, but the damage is rooted in the fact that a program was able to access a resource that it should not be able to access in the first place. As writing secure software is out of scope, the private key could have been saved if the principle of least privilege would have been enforced by some means. It says, in a nutshell, that each component, i.e., the chat software, should only have the necessary privileges and nothing more. Many roads might lead to this state, e.g., not using the same user for private key interactions and chatting or sandboxing the chat application.
When developing software, the developer should know what their tool should be able to do. Thus, they are able to carve out the allowed territories, denying everything else with the help of system features. As a metaphor, think of a werewolf chaining themself up before full moon.
In case you are asking yourself right now why you should do this to your code as it will never fail, then especially you should do this. For most applications out there, the question is not if they can be broken, but more when they will be broken. Since I wrote many bugs throughout the years and saw exploits unable to grasp, I am trying to self-chain all my future werewolves.
Changes In Software Architecture
The idea of self-restricting software is that given up privileges cannot be gotten back. For example, once the program denied itself file system access, no more files can be opened.
The software starts as a certain user, sometimes the root
user, e.g., to use an restricted network port.
Thus, after starting to listen on this port, this privilege can be dropped, e.g., by switching to an unprivileged user while keeping the file descriptor.
The software continues accepting connections on the prior bound port, but cannot start listening on other restricted ports.
This limitation must be taken into account when planning the software. Instead of being able to access all resources when necessary, they must be acquired in the beginning phase before self-restricting. To introduce another questionable metaphor, think about a funnel or an upside down cone: Your program starts with all these privileges, dropping it along as it goes until it continues with just the bare minimum.
Good Old Chroot And User-Switch
Let’s start with the classic approach of chrooting and user/group changing. I called it classical, because this variant goes back to the early 1990s and it works on any POSIX-like operating system (think BSDs, Linux and friends).
Unfortunately, this approach has the annoyance that the necessary system calls are reserved for the root
user.
While in most cases this is not a problem for daemons, this would be a blocker for end user applications, like GUIs.
Instead of suggesting some SUID
file flag madness, secure alternatives for rootless scenarios will follow.
chroot(2)
First things first: chroot(2)
changes the process’ root directory to the supplied one.
For example, /
becomes /var/empty
and accessing /etc/passwd
would actually try to open /var/empty/etc/passwd
.
To activate the chroot(2)
, the process needs to chdir(2)
into it.
Unless further actions are taken, an attacker can break out of a chroot. Actually, chrooting is not a security feature, but can be used to build one - as this post attempts. Nevertheless, please be aware of the limitations.
A chroot directly impacts the process.
If it should not interact with any files, chrooting to /var/empty
or some just created empty directory makes sense.
If there is one directory, one can consider chrooting to this directory.
If, however, files from different locations must be accessed, a strict chroot
can be a burden.
This is a decision one has to take on a case by case basis.
After importing golang.org/x/sys/unix
, the following snippet is enough to chroot(2)
the process to /var/empty
, which is a “[g]eneric chroot(2) directory”, according to OpenBSD’s hier(2)
.
if err := unix.Chroot("/var/empty"); err != nil {
log.Fatalf("chroot: %v", err)
}
if err := unix.Chdir("/"); err != nil {
log.Fatalf("chdir: %v", err)
}
setuid(2)
or setresuid(2)
The process may now be chrooted to an empty directory, but otherwise still runs as root
.
To give up root
privileges, let the process switch user rights to an unprivileged user without qualities.
Throughout the ages, multiple syscalls for user switching have emerged, starting at setuid(2)
.
While not strictly being part of POSIX, the setresuid(2)
syscall is available on most operating systems.
It allows setting the real, effective and saved user ID, where there are fine differences between them.
These may differ when developing or using a SUID
application, having the real user ID of your user and the effective user ID of the root
user.
In this case, however, we just want to drop all privileges to our unqualified user, setting all three user IDs to the same user.
The same applies to groups with setresgid(2)
.
In addition, as a process may have multiple groups, this list can be shortened via setgroups(2)
.
Doing so results in only the group privileges of the given groups applies, not all other user groups.
For the example, create an unprivileged worker user. On a Linux, this can be done as the following:
$ sudo useradd \
--home-dir /var/empty \
--system \
--shell /run/current-system/sw/bin/nologin \
--user-group \
demoworker
$ id demoworker
uid=992(demoworker) gid=987(demoworker) groups=987(demoworker)
Continue with the following short code block:
// Prior chroot code
uid, gid := 992, 987
if err := unix.Setgroups([]int{gid}); err != nil {
log.Fatalf("setgroups: %v", err)
}
if err := unix.Setresgid(gid, gid, gid); err != nil {
log.Fatalf("setresgid: %v", err)
}
if err := unix.Setresuid(uid, uid, uid); err != nil {
log.Fatalf("setresuid: %v", err)
}
While this snippet works as a demonstration, having to actually configure the user and group ID is a bit too much.
Thus, let the code do the lookup by writing a short helper function around os/user
.
One word of caution regarding the os/user
package:
It caches the current user and one cannot simply invalidate this cache.
Thus, after using the following helper function user.Current()
will always return whatever executed this function first.
// uidGidForUserGroup fetches an UID and GID for the given user and group.
func uidGidForUserGroup(username, groupname string) (uid, gid int, err error) {
userStruct, err := user.Lookup(username)
if err != nil {
return
}
userId, err := strconv.ParseInt(userStruct.Uid, 10, 64)
if err != nil {
return
}
groupStruct, err := user.LookupGroup(groupname)
if err != nil {
return
}
groupId, err := strconv.ParseInt(groupStruct.Gid, 10, 64)
if err != nil {
return
}
uid, gid = int(userId), int(groupId)
return
}
When using this function, a word of caution is necessary as this might be the first situation where the chroot can shoot us in the foot.
The lookup requires the /etc/passwd
and /etc/group
files to be accessible.
Thus, when already be chrooted to /var/empty
, this will obviously fail:
open /etc/passwd: no such file or directory
First, do the lookup, then chroot(2)
, and finally perform the user/group switching.
// Start with root privileges, do necessary lookups.
uid, gid, err := uidGidForUserGroup("demoworker", "demoworker")
if err != nil {
log.Fatalf("user/group lookup: %v", err)
}
// Drop into chroot
if err := unix.Chroot("/var/empty"); err != nil {
log.Fatalf("chroot: %v", err)
}
if err := unix.Chdir("/"); err != nil {
log.Fatalf("chdir: %v", err)
}
// Switch to an unprivileged user unable to escape chroot
if err := unix.Setgroups([]int{gid}); err != nil {
log.Fatalf("setgroups: %v", err)
}
if err := unix.Setresgid(gid, gid, gid); err != nil {
log.Fatalf("setresgid: %v", err)
}
if err := unix.Setresuid(uid, uid, uid); err != nil {
log.Fatalf("setresuid: %v", err)
}
// Application code follows
Limiting Resources The POSIX Way
After chrooting and dropping root
user privileges, the code may no longer access privileged system APIs or some files, but can still run cycles.
For example, a parser may be vulnerable to a billion laughs attack, resulting in either 100% CPU load or even memory exhaustion.
One good old POSIX way to limit different kinds of resources is setrlimit(2)
.
Depending on the targeting operating system, different resources are defined.
Both horror scenarios from the example can be addressed either via RLIMIT_CPU
for CPU time or via RLIMIT_DATA
for data.
RLIMIT_CPU
The CPU time or process time is the amount of actively consumed CPU cycles of a single process. If the process calculates something nonstop, this counter raises. However, if the process waits for certain events, the counter idles as well.
As an example, idle for a bit and then burn CPU cycles by useless hash computations while limiting the CPU time to one second.
if err := unix.Setrlimit(
unix.RLIMIT_CPU,
&unix.Rlimit{Max: 1},
); err != nil {
log.Fatal("setrlimit: %v", err)
}
log.Println("CPU time != execution time, hanging low")
time.Sleep(5 * time.Second)
log.Println("STRESS!")
buff := make([]byte, 32)
for i := uint64(1); ; i++ {
_, _ = rand.Read(buff)
_ = sha256.Sum256(buff)
if i%100_000 == 0 {
log.Printf("Run %d", i)
}
}
Putting this example into a main
function shows how the process gets aborted after consuming too much CPU time.
2025/01/26 20:17:18 CPU time != execution time, hanging low
2025/01/26 20:17:23 STRESS!
2025/01/26 20:17:23 Run 100000
[ . . . ]
2025/01/26 20:17:24 Run 800000
[1] 190379 killed ./02-01-setrlimit-cpu
RLIMIT_DATA
This second example limits the maximum data segments to 10MiB or, in other words, restricts the amount of available memory to 10MiB.
The code will allocate memory within an infinite loop, resulting an out of memory situation.
However, due to the setrlimit(2)
call, the process gets aborted.
if err := unix.Setrlimit(
unix.RLIMIT_DATA,
&unix.Rlimit{Max: 10 * 1024 * 1024},
); err != nil {
log.Fatal("setrlimit: %v", err)
}
var blobs [][]byte
for i := uint64(1); ; i++ {
buff := make([]byte, 1024)
_, _ = rand.Read(buff)
blobs = append(blobs, buff)
if i%1_000 == 0 {
log.Printf("Allocated %dK", i)
}
}
And this will be aborted due to its memory hunger.
2025/01/26 20:17:44 Allocated 1000K
2025/01/26 20:17:44 Allocated 2000K
fatal error: runtime: out of memory
[ . . . ]
Good Hard Limits?
These two examples set hard limits, resulting in eventually aborting the process.
Especially RLIMIT_CPU
, being an increasing counter, will be reached.
Thus, what are good values? Depends, of course.
Just to be sure, if deciding to set any limits, make them high enough to not be reached anyhow during normal operation. If something goes south, they are still there as a safety net.
And what about soft limits? That is an exercise left for the reader.
Doubling Down With OS-specific Features
Everything so far should work on most POSIX-like operating systems. As an advantage, using these patterns in your program may work on platforms you did not even know existed.
But there are also OS-specific mechanisms allowing to drop privileges, even when not starting as the root
but with an usual user.
Since my personal experience is limited to Linux and OpenBSD, I will address some of their features.
With OpenBSD having simpler APIs to use, I will start there.
Restricting Syscalls On OpenBSD
The operating system’s kernel verifies if the user privileges are sufficient to access some resource.
For example, when trying to open(2)
a file, the operating system may deny this.
This check happens within open(2)
.
But what if the program itself cannot even use open(2)
since the developer knows that it never must open a file in the first place.
Welcome to system call filtering, allowing a program to restrict what syscalls are being used later on.
If there is a thing like having a favorite system call, mine might be OpenBSD’s pledge(2)
.
It provides a simple string-based API to restrict the available system calls based on space separated keywords, called promises.
These promises are names of syscall groups, e.g., rpath
for read-only system calls regarding the file system.
By adding exec
to the list, executing other programs will be allowed, starting them with their own promise, given in the second parameter.
int pledge(const char *promises, const char *execpromises);
After a pledge(2)
was made, it cannot be undone, only be tightened.
Tightening means calling pledge(2)
another time with a shorter promises list.
In case of violating the syscall promise, the process gets either killed or, if error
is part of the promise, the denied system call returns an error.
As being a system call, it is available for Go in golang.org/x/sys/unix
.
Unfortunately, the web Go Packages thingy only renders the docs for some selected platforms, not including OpenBSD.
Thus, I took the liberty to paste the docs below.
By the way, setting the GOOS
or GOARCH
environment variable also works for go doc
, e.g., GOOS=openbsd go doc -all golang.org/x/sys/unix
works on a Linux.
func Pledge(promises, execpromises string) error
Pledge implements the pledge syscall.
This changes both the promises and execpromises; use PledgePromises or
PledgeExecpromises to only change the promises or execpromises respectively.
For more information see pledge(2).
func PledgeExecpromises(execpromises string) error
PledgeExecpromises implements the pledge syscall.
This changes the execpromises and leaves the promises untouched.
For more information see pledge(2).
func PledgePromises(promises string) error
PledgePromises implements the pledge syscall.
This changes the promises and leaves the execpromises untouched.
For more information see pledge(2).
Now, there are three Go functions for pledge(2)
: One to set both parameters and one to set only the first or second one.
Let’s create an example for a simple program working with an input file, making a promise just allowing to read the file and making another tighter one after it was read.
Use your imagination what the program should do, e.g., it can convert an image to another format and printing it out to stdout.
// Start with limited privileges
if err := unix.PledgePromises("stdio rpath error"); err != nil {
log.Fatalf("pledge: %v", err)
}
// Read input file
f, err := os.Open("input")
if err != nil {
log.Fatalf("cannot open input: %v", err)
}
inputFile, err := io.ReadAll(f)
if err != nil {
log.Fatalf("cannot read input: %v", err)
}
if err := f.Close(); err != nil {
log.Fatalf("cannot close input: %v", err)
}
// Drop further, reading files is no loner necessary
if err := unix.PledgePromises("stdio error"); err != nil {
log.Fatalf("pledge: %v", err)
}
// Do some computation based on the input
As the example shows, using pledge(2)
is both easy and boring.
That might be the reason why most programs being shipped with OpenBSD are pledged and there are lots of patches for ported software.
Only one command results in dropping so much privileges.
Impressive.
Restricting File System Access On OpenBSD
This post opened with the constructed example of a pwned chat program used to exfiltrate the user’s SSH private key.
What if a program could make a promise which file system path are needed, denying every other access?
OpenBSD’s unveil(2)
addresses this, similar as pledge(2)
does for system calls.
Multiple unveil(2)
calls are creating an allow-list to unveiled paths the program is allowed to access.
Each call adds a path and the kind of permission: read, write, exec and/or create.
After a finalizing call of two empty parameters, this will be enforced.
Thus, if the chat program would use unveil(2)
for relevant directories - definitely not containing ~/.ssh
-, this exploit would have been mitigated.
int unveil(const char *path, const char *permissions);
This system call is available for Go in golang.org/x/sys/unix
as well.
func Unveil(path string, flags string) error
Unveil implements the unveil syscall. For more information see unveil(2).
Note that the special case of blocking further unveil calls is handled by
UnveilBlock.
func UnveilBlock() error
UnveilBlock blocks future unveil calls. For more information see unveil(2).
Thus, in Go multiple unix.Unveil(...)
calls might be issued with a closing unix.UnveilBlock()
.
Continuing with the chat program example, a program can be written to allow read/write/create access to some Download directory to store cringy memes. All other file system requests should be denied.
// Restrict read/write/create file system access to the ./Download directory.
// This does not include exec!
if err := unix.Unveil("Download", "rwc"); err != nil {
log.Fatalf("unveil: %v", err)
}
if err := unix.UnveilBlock(); err != nil {
log.Fatalf("unveil: %v", err)
}
// Buggy application starts here: allowing path traversal
userInput := "../.ssh/id_ed25519"
f, err := os.Open("Download/" + userInput)
if err != nil {
log.Fatalf("cannot open file: %v", err)
}
defer f.Close()
privateKey, err := io.ReadAll(f)
if err != nil {
log.Fatalf("cannot read: %v", err)
}
log.Printf("looks familiar?\n%s", privateKey)
Taking this code for a test drive shows how a good old path traversal attack was mitigated, even if the file exists.
$ ./04-openbsd-unveil
2025/01/26 22:11:57 cannot open file: open Download/../.ssh/id_ed25519: no such file or directory
$ ls -l Download/../.ssh/id_ed25519
-rw------- 1 user user 420 Jan 26 22:10 Download/../.ssh/id_ed25519
Great success!
Restricting Syscalls On Linux
Let’s switch operating systems and focus on the Linux kernel for a while.
The similar named section about system call filtering on OpenBSD started with a few sentences about how system calls are the gateway between programs and the kernel to access resources. Same applies to Linux, and almost any operating system out there. Denying unnecessary syscalls directly implies restricting privileges.
Linux comes with a very powerful tool, Seccomp BPF, allowing each process to supply a program to the kernel, deciding which system calls are being allowed. This program is an Berkeley Packet Filter (BPF), receiving both the syscall number and (some) arguments. Thus, it is even possible to allow or deny only certain parameters of a syscall.
Obviously, this great flexibility has its ups and downs. While there are situations where one might want to create a very specific filter, a quick-and-dirty one might be more common. At least in my experience as an unwashed userland developer, I mostly prefer a more rough filter, especially in Go where mostly not interacting with system calls directly.
But where to start?
When wanting to get the pure Seccomp BPF experience in Go, there is the github.com/elastic/go-seccomp-bpf
package, written in pure Go without any cgo.
It allows developing fine-grained filters on a system call basis, as introduced above.
When I first used it, I had no real clue where to start with writing a filter for my Go program.
Starting with an all denying filter and using Linux’ auditd(8)
I made progress in getting hopefully all necessary syscalls.
But then I had to realize that updating Go or any of my dependencies may result in other code (d’oh!) and therefore other system calls.
Another constraint is that there are slightly differences between the available system calls of different architectures.
Thus, I started to play around with a more wider filter list and eventually “borrowed” the sets available in systemd’s SystemCallFilter
.
Administrators may know this feature already, allowing to restrict the available system calls for each service under systemd’s control through a list of system calls, quite familiar to pledge(2)
.
After a while I built myself a small library exactly for this use case, github.com/oxzi/syscallset-go
.
From a developer’s perspective, it serves a simple string-based API to build yourself a list of allowed system calls through groups.
Honestly, I just shipped systemd’s code to Go and made it fly using the aforesaid go-seccomp-bpf
library.
But it works.
// Start with limited privileges
if err := syscallset.LimitTo("@system-service"); err != nil {
log.Fatalf("seccomp: %v", err)
}
// Read input file
f, err := os.Open("input")
if err != nil {
log.Fatalf("cannot open input: %v", err)
}
inputFile, err := io.ReadAll(f)
if err != nil {
log.Fatalf("cannot read input: %v", err)
}
if err := f.Close(); err != nil {
log.Fatalf("cannot close input: %v", err)
}
// Drop further, reading files is no loner necessary
if err := syscallset.LimitTo("@basic-io"); err != nil {
log.Fatalf("seccomp: %v", err)
}
// Do some computation based on the input
The not dozed off reader might recognize this code, as it is almost identical to the demo code used for pledge(2)
above.
However, there are a few small differences.
First and most obvious, the filters differ.
There are also no meta-filters on OpenBSD like the used @system-service
, containing lots of commonly used system calls.
Furthermore, OpenBSD’s pledge(2)
had the handy error
group, allowing forbidden system calls to fail, when set.
Otherwise, the kernel would kill the process.
This behavior also applies here, were each misstep will directly be punished by process execution.
Restricting File System Access On Linux… And Network Access As Well
For symmetry sake, Linux’ answer to unveil(2)
must follow.
And it does, but it is also way more.
Please welcome Landlock LSM.
While Landlock started to address the same issues as unveil(2)
- limiting file system access -, it recently grow to also allow certain network isolations.
But let the code speak for us, using the github.com/landlock-lsm/go-landlock
library.
Restricting Paths
// Restrict file system access to the ./Download directory.
if err := landlock.V5.BestEffort().RestrictPaths(
landlock.RWDirs("Download"),
); err != nil {
log.Fatalf("landlock: %v", err)
}
// Buggy application starts here: allowing path traversal
userInput := "../.ssh/id_ed25519"
f, err := os.Open("Download/" + userInput)
if err != nil {
log.Fatalf("cannot open file: %v", err)
}
defer f.Close()
privateKey, err := io.ReadAll(f)
if err != nil {
log.Fatalf("cannot read: %v", err)
}
log.Printf("looks familiar?\n%s", privateKey)
This code might also look familiar, because it is an adapted version of the earlier unveil(2)
example.
One thing worth mentioning might be the V5.BestEffort()
part.
Landlock itself is versioned, growing with Linux releases in features.
But to build a Go program compatible with older targets, the BestEffort
part falls back to what the targeted kernel supports.
In case this is not desired, directly use V5.RestrictPaths
- or whatever version is the latest when reading.
Restricting Network
At the moment, the possibilities to restrict network connections is still a bit limited in Landlock, but therefore has a simple API. In case you are looking for a full-fledged network limitation suite on Linux, maybe take a look at cgroup eBPF-based network filtering.
So, what is definitely possible? The application is able to limit both inbound and outbound TCP traffic based on ports. Or, in simpler words, one can allow certain TCP ports.
// Restrict outbound TCP connections to port 443.
if err := landlock.V5.BestEffort().RestrictNet(
landlock.ConnectTCP(443),
); err != nil {
log.Fatalf("landlock: %v", err)
}
// HTTP should fail, while HTTPS should work.
for _, proto := range []string{"http", "https"} {
_, err := http.Get(proto + "://pkg.go.dev/")
log.Printf("%q worked: %t\t%v", proto, err == nil, err)
}
This little example only allows outgoing TCP connections to port 443, as the failing HTTP (port 80) connection shows. Of course, this is no secure way to restrict your application to only use HTTPS.
./06-02-linux-landlock-tcp
2025/01/27 21:35:32 "http" worked: false Get "http://pkg.go.dev/": dial tcp [2600:1901:0:f535::]:80: connect: permission denied
2025/01/27 21:35:32 "https" worked: true <nil>
And finally, there is also landlock.BindTCP
as a rule option, restricting the TCP ports to be bound to.
This may be especially useful when fearing that an attacker launches a shell.
Is This All? Are We Finished?
Have I now covered all possible options for an application to self-restrict its privileges?
Obviously not.
I have not even started covering Linux’ cgroups, itself allowing a wide range of restrictions.
And then there are even other operating systems, like FreeBSD with its capsicum(4)
.
My main goal for this post was to show that there are quite simple APIs to restrict limiting privileges. OpenBSD itself comes with simple APIs and for Linux there are the two shown Go libraries making these big, very configurable features also usable as a one-liner.
Then there is setrlimit(2)
, which one might wanna use or might wanna ignore.
And of course the root
-restricted chroot(2)
/setresuid(2)
dance.
Thus, you as a developer have a choice. As shown, it is quite easy adding some of the introduced mechanisms to protect your software against future mischief. I would urge you to give it a try, limiting the attack surface of your program.
The examples used in this post and more are available in the following git repository, https://codeberg.org/oxzi/go-privsep-showcase. May it be useful.
Thursday, 30 January 2025
AI images should be copyrightable
In September 2022, ‘Théâtre D’opéra Spatial’, a work submitted by Jason Allen, won Colorado State Fair’s annual fine art competition in the digital art category. What made the success noteworthy was that the image had been AI-generated. Mr Allen eventually tried to register the work with the US Copyright Office but his attempts turned out fruitless. In September 2023 the Office refused his registration.


I didn’t think much of it at the time. I wasn’t that invested in the consideration of what kind of ‘two-dimensional artworks’ are protected by copyright and, more notably, I somewhat agreed with the decision. Perhaps the prompt was protected, but if only minor manual edits were made to the image, it felt like a stretch to say the image as a whole could be covered by copyright law.
An entrance to paradise
‘Théâtre D’opéra Spatial’ wasn’t the first time US Copyright Office dealt with computer-generated images. Back in 2016, Steven Thaler’s Creativity Machine created a series of images with one of them — ‘A Recent Entrance to Paradise’ — ending up under the Office’s deliberation. The result was the same: in February 2022, the Office refused the registration.1 I was made aware of the image only after Leonard French, a US copyright attorney, made a video about the case. In it, he made the following comparison to photography:
Maybe [Thaler] could have argued that he had to select what images got put into the AI to create the output. Maybe he can argue that there was some creativity in selecting the images and choosing the setting for the AI very much like we have protections for photographs because there is at least minimal human creativity in choosing what to photograph, with what lens, what focal length, what focus, what iris or aperture, what shutter speed, and then what post-processing settings to use. That’s enough. It’s maybe not that much, but it’s enough to reach copyright protection.
This made me wonder what actually is the minimal human creativity that’s required for a work to be subject to copyright.
The lazy tourist

Specifically, consider a lazy tourist. In his travels, he ends up on a beach and notices a flock of birds in the water. He decides to take a photo of them with his mobile phone. A marvellous device makes taking photos a seamless experience — no need to pick any settings, the phone automatically chooses all of them. The tourist is a lazy one so he’s not going to walk too far to find a perfect framing. He wonders a few steps, points his phone and makes a new work covered by the copyright.
How much human creativity was there in this process? The subject of the photo wasn’t carefully planned. The tourist didn’t wake up one day deciding to make a photo of a flock of birds in the water. He just stumbled upon them. The choice of the subject was highly constraint without room for much creativity.
Similarly, the tourist chose the framing but here again the options were limited. He didn’t use a drone to make an aerial shot, waited patiently for perfect composition nor looked for interesting framing with other elements in the frame. The whole process took him just a couple of minutes.
And with the amount of automation in mobile phones, the tourist had no creative input into the settings of the camera or post-processing steps. And despite all that, the photo is covered by the copyright.
The prompter
Compare that to someone entering prompt into a text-to-image tool such as Stable Diffusion. There are countless possibilities of subjects to create images of. Flock of birds in the sea? Why not weyr of dragons in space? Or giant frogs on stills devouring humans? Similarly there are many options for framing. Is the subject of the image seen from the ground? From above? Are there foreground elements which obscure the subject?
And on top of that came all the settings that need to be tweaked to produce half-decent AI-generated image. Multiple models to choose from, samplers, guidance scale, control nets and so on.
Conclusion
AI is a controversial topic; there are many aspects of it that can lead to heated discussions. It’s not clear whether using publicly-available data to train the models is fair use or not. It’s also not clear whether a model which can reproduce letter-for-letter copyrighted works isn’t infringing in the first place.
Similarly copyrightability issue is hardly settled. The registration request for ‘A Recent Entrance to Paradise’ doesn’t tell us much since it claimed computer as the author. And in case of ‘Théâtre D’opéra Spatial’, the US Copyright Office did ‘not foreclose Mr Allen’s ability to file a new application for registration of the Work in which he disclaims the Work’s AI-generated material. In such a case, the Office could consider whether the human-authored elements of the Work can sustain a claim for copyright, an issue we have not decided here.’ Similarly, Leonard French theorised in his video that ‘maybe there is a point where an AI generated work could be protectable if there was enough creative human input on the input side or if you did that on the output side and created something with the AI-generated work.’
We’re many lawsuits away from having a clear picture of what kind of AI-generated works will be covered by copyright, but the position that photos made with a phone can be protected while images created with generative models cannot is inconsistent. The latter entails a lot more creativity and skill than majority of copyrighted photographs taken each day around the world.
Sunday, 26 January 2025
Rust’s worst feature*
* available in Rust nightly.
There are several aspects of Rust that I’m not particularly fond of but the one that takes the cake is core::io::BorrowedBuf
which I despise with passion. It’s a nightly feature which puts in question my extreme emotions about it. On the other hand it means there’s time to stop it from getting stabilised and figure out something better.
In this article I’ll describe the problem the feature addresses, the issues I have with the solution and describe some alternatives. As it turns out, things aren’t as easy as they seem on the first look.
The problem
Consider the slow_copy
routine below which copies data between two I/O streams. On each iteration of the loop, it zero-initialises the buffer which wastes time considering that read
will override the data. The compiler doesn’t know that and has no choice but to fill the array with zeros each time. Even an obvious optimisation of moving the buffer declaration outside of the loop isn’t available to the compiler.
fn slow_copy( mut rd: impl std::io::Read, mut wr: impl std::io::Write, ) -> std::io::Result<()> { loop { let mut buf = [0; 4096]; let read = rd.read(&mut buf)?; if read == 0 { break Ok(()); } wr.write_all(&buf[..read])?; } }
An attempt at a solution is to use MaybeUninit
which makes it possible to declare a region of uninitialised memory. Some explicit pointer casting is necessary, but otherwise code using it is straightforward.
use core::mem::MaybeUninit; pub fn unsound_copy( mut rd: impl std::io::Read mut wr: impl std::io::Write, ) -> std::io::Result<()> { loop { let mut buf = [MaybeUninit::<u8>::uninit(); 4096]; // SAFETY: This is unsound. // For demonstration purposes only. let buf = unsafe { &mut *(&mut buf as *mut [_] as *mut [u8]) }; let read = rd.read(buf)?; if read == 0 { break Ok(()); } wr.write_all(&buf[..read])?; } }
While replacing the array of zeros by an array of uninitialised values may work in specific circumstances, the code is unsound. Change to the compiler, its options, modification of unrelated parts of the code or using the function for a different Read
trait implementation may break the program in unpredictable ways.
The BorrowedBuf
The solution in nightly Rust is the BorrowedBuf
struct. It’s a bytes slice which remembers how much of it has been initialised. It doesn’t own the memory and operates on a borrowed buffer (hence the name). It can point at an array on the stack or a slice living on the heap (such as Vec
’s spare capacity). A naïve use of the feature is the following:
#![feature(core_io_borrowed_buf, read_buf)] use core::io::BorrowedBuf; use core::mem::MaybeUninit; fn almost_good_copy( mut rd: impl std::io::Read, mut wr: impl std::io::Write, ) -> std::io::Result<()> { loop { let mut buf = [MaybeUninit::uninit(); 4096]; let mut buf = BorrowedBuf::from(&mut buf[..]); rd.read_buf(buf.unfilled())?; if buf.len() == 0 { break Ok(()); } wr.write_all(buf.filled())?; } }
Issues with the BorrowedBuf
While almost_good_copy
appears to work as expected, BorrowedBuf
isn’t without its share of problems. I describe them below.
Optionality
The BorrowedBuf
does not seamlessly integrate with existing Rust code. In fact quite the opposite. APIs need to support it explicitly. For example, many third-party Read
implementations do not provide read_buf
method. In its absence, the default version initialises the memory and calls read
negating any potential benefits of BorrowedBuf
.
Similarly, functions which take output slice as an argument — such as rand
’s RngCore::fill_bytes
— could benefit from being able to write to uninitialised memory. However, to offer that benefit, they need to be changed to support BorrowedBuf
. A motivated programmer can try adding necessary support to actively maintained packages, like rand
, but what if one is stuck at an older version of the crate or deals with apparently abandoned crates like hex
or base64
. To support those cases, forking would be necessary leading the programmer towards deeper circles of dependency hell.
Ergonomics
Then again, should functions such as fill_bytes
integrate with BorrowedBuf
in the first place instead of taking an &mut [MaybeUninit<u8>]
argument? The issue with the latter is that there’s no safe way to convert &mut [T]
into &mut [MaybeUninit<T>]
.1 As such, users who so far happily used such functions with regular initialised buffers would need convoluted incantation to make their previously straightforward code to compile. Meanwhile, creating BorrowedBuf
is somewhat convenient and can be done from initialised and uninitialised buffers alike.
Lack of generality
In addition to RngCore::fill_bytes
, the rand
crate offers a Rng::fill
method which fills a generic slice of integers with random data. It could easily work with BorrowedBuf
except that the struct works on u8
slices only. As a result, a crate which deals with different types cannot consistently use BorrowedBuf
.
I don’t know the reasons why BorrowedBuf
is not generic. It’s possible that its design focused on addressing the the Read
trait use case only. Complications around dealing with Drop
types could have been a contributing factor. However, even then the type could be generic on Copy
types.
Easy of misuse
Read::read_buf
being optional brings another problem. Without full understanding of the behaviour and interactions of the BorrowedBuf
type, it’s easy to misuse it such as in almost_good_copy
. One can be excused from assuming that the function eliminates unnecessary initialisation. It declares an uninitialised region, wraps it in BorrowedBuf
and reads data into it. Even inspection of the assembly output shows lack of the memset
call.
Alas, while almost_good_copy
avoids memory initialisation when reading data from a File
, it wastes time zeroing the buffer when, for example, decompressing data with help of flate2
crate (which does not offer custom read_buf
method) effectively becoming a slow_copy
.
Unless the underlying type is known, the programmer must assume that read_buf
may resort to filling the memory. The proper use of BorrowedBuf
is to construct it only once so that it can remember that the memory has been initialised.
#![feature(core_io_borrowed_buf, read_buf)] use core::io::BorrowedBuf; use core::mem::MaybeUninit; fn copy( mut rd: impl std::io::Read, mut wr: impl std::io::Write, ) -> std::io::Result<()> { let mut buf = [MaybeUninit::uninit(); 4096]; let mut buf = BorrowedBuf::from(&mut buf[..]); loop { buf.clear(); rd.read_buf(buf.unfilled())?; if buf.len() == 0 { break Ok(()); } wr.write_all(buf.filled())?; } }
Complexity
With BorrowedBuf
’s complexity it’s not hard to imagine why people might use it in inefficient way. The struct is harder to understand than the unsound casting in unsound_copy
. This may lead people to use the more straightforward option even if it’s not correct. An analogy to a Vec<u8>
with its contents and spare capacity partially helps — a BorrowedBuf
has analogous filled and unfilled parts — but is an oversimplified view. A BorrowedBuf
is also split into initialised and uninitialised parts. The documentation visualises it as follows:
Capacity | ||
Filled | Unfilled | |
Initialised | Uninitialised |
There are reasons for this madness. Consider loop in the copy
function above. If BorrowedBuf
only knew how much of it was filled, each call to buf.clear()
would lose the information about memory being initialised. In the default implementation of read_buf
it would need to unnecessarily zero the whole buffer. Separately storing information about how much of the buffer has been filled and initialised, let the type avoid double-initialisation of memory.
Alternative model
As an aside, I find modelling BorrowedBuf
as divided into filled and spare capacity with spare capacity further divided into initialised and uninitialised as more intuitive. Leaning into the analogy of Vec
is in my opinion more natural and it helps by reinforcing terminology used in existing parts of the language rather than introducing new models.
Capacity | ||
Filled | Spare capacity | |
Initialised | Uninitialised |
What people want
Having looked at issues with BorrowedBuf
, let’s consider what people actually want.2 The easiest mental model is that uninitialised memory stores arbitrary data, unknown unless accessed. To achieve such semantics, the uninitialised memory would need to be frozen. A frozen region becomes safe to read and can be accessed through regular Rust references. With freezing operation available, the buffer definition in the copying routine could be turned into something like:
let mut buf = [MaybeUninit::uninit(); 4096]; // SAFETY: u8 has no invalid bit patterns. let buf = unsafe { MaybeUninit::slice_freeze_mut(&mut buf) };
Or alternatively:
let buf = MaybeUninit::frozen(); // SAFETY: u8 has no invalid bit patterns. let mut buf: [u8; 4096] = unsafe { buf.assume_init() };
Unsafe blocks are required to account for invalid bit patterns. With a trait like bytemuck::AnyBitPattern
, a safe versions could exist. Either of those alternatives would require no new methods on the Read
trait and would work without any modifications on methods such as rand
’s fill_bytes
.
Why can’t we have what we want?
Reading uninitialised memory is hardly an issue when analysing things on hardware level. So long as a memory address is mapped with proper permissions, accessing data from it will always produce some value. There’s no undefined behaviour there.3 In fact, in typical Linux environment all newly allocated anonymous pages are zero-initialised.4
tautology: cmp BYTE PTR [rdi], 0 je tautology_ok cmp BYTE PTR [rdi], 0 jne tautology_ok mov al, 0 ret tautology_ok: mov al, 1 ret
MADV_FREE
and the kernel changes the mapping in between the two memory reads.Unfortunately, even when looking from the point of view of machine code, this analysis isn’t complete…
Giving advice about use of memory
MADV_FREE
flag of the madvise
system call allows user space to advise the kernel that (until next write) it no longer cares about contents of specified anonymous pages. This optimisation enables the kernel to discard those pages without swapping them to disk. While the advice is in effect, the user space can access the memory, but has no guarantee whether it’ll read the old values or zeros. Even code written directly in assembly language, like the tautolagy
function on the right can result in unexpected behaviour.
This isn’t a theoretical concern either. jemalloc, a somewhat popular memory allocator, uses MADV_FREE
when memory is freed. As a result, new allocations returned from the allocator may point to region of memory where the MADV_FREE
advice is in effect. Nicholas Ormrod, in his talk about C++ std::string
at Facebook, describes how interaction between jemalloc, MADV_FREE
and reading uninitialised memory resulted in outages.
Page touching
To prevent this issue, the proposed slice_freeze_mut
function would need to write into each page of the slice to make sure the kernel notices that the program cares about contents of the page again. This could be a simple loop stepping 4 KiB at a time and look something like the following:
pub unsafe fn slice_freeze_mut<T>( slice: &mut [MaybeUninit<T>] ) -> &mut [T] { const PAGE_SIZE: usize = 4096; let ptr = slice.as_mut_ptr() as *mut _; let len = slice.len() * size_of::<T>(); // SAFETY: It’s always safe to split MU object into MU bytes. let bytes: &mut [MaybeUninit<u8>] = unsafe { core::slice::from_raw_parts(ptr, len); }; for el in bytes.iter_mut().step_by(PAGE_SIZE) { let p = el.as_mut_ptr(); // SAFETY: Unsafe without language semantics change // since we’re reading uninitialised byte. unsafe { p.write_volatile(p.read()) }; } // SAFETY: Caller promises that T has no invalid bit patterns, // but this is still unsafe without language semantics change // since we haven’t initialised all the bytes. unsafe { &mut *(slice as *mut _ as *mut [T]) } }
Unfortunately, this would hardly be the no-operation that people expect from writing into uninitialised memory. It would be an improvement over a full initialisation and would address some issues with BorrowedBuf
but would do that at the cost of unavoidable page touching.
It may seem that the second form — the MaybeUninit::frozen().assume_init()
variant — which creates frozen buffer directly on stack could be easier to optimise. The compiler controls the stack and unless it issues madvise
, no stack pages will be marked MADV_FREE
. Unfortunately it’s not clear that always hold true. For example, with async programming the stack lives God-knows-where and there may be other corner cases that would need to be considered.
Conclusion
I started this article with a promise of some alternatives to BorrowedBuf
and yet, as I conclude it, no working alternative is presented. Indeed, this is perhaps what frustrates me the most about the BorrowedBuf
. On the face of it, writing data into uninitialised memory is a feature with an obvious solution, but it doesn’t take long before all the obvious solutions clash with Rust’s safety requirements.
So what’s a lowly programmer to do? Donald Knuth is often quoted as stating that ‘premature optimisation is the root of all evil’. True to that adage, in most cases it’s safe to pay the price of the memory initialisation. I/O operations usually take orders of magnitude more time so the time saved not initialising the memory is often negligible.
But there is more to Knuth’s quote:
We should forget about small efficiencies, say about 97% of the time: premature optimisation is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.5
For the remaining 3%, the options now are somewhat bleak and depend on the particular code base. They may require switching to nightly compiler, patching third-party crates, going straight to doing unsafe syscalls (e.g. read
) or isolating critical code paths and writing them in C.
And while we deal with the lack of ideal solution for writing to uninitialised memory, maybe someone will figure out some alternative fast and ergonomic approach.
Wednesday, 15 January 2025
Sunday, 12 January 2025
Human error is not the root cause
In 2023 UniSuper, an Australian retirement fund, decided to migrate part of its operations to Google Cloud. As port of the migration, they needed to create virtual machines provisioned with limits higher than what Google’s user interface allowed to set. To achieve their goals, UniSuper contacted Google support. Having access to internal tools, Google engineer was able to create requested instances.
Fast forward to May 2024. UniSuper members lose access to their accounts. The fund blames Google. Some people are sceptical, but eventually UniSuper and Google Cloud publish a joint statement which points at ‘a misconfiguration during provisioning’ as cause of the outage. Later, a postmortem of the incident sheds even more light on events which have transpired.
Turns out that back in 2023, Google engineer used a command line tool to manually create cloud instances according to UniSuper’s requirements. Among various options, said tool had a switch setting cloud instance’s term. The engineer omitted it leading to the instance being created with a fixed term which triggered automatic deletion a year later.
So, human error. Scold the engineer and case closed. Or is it?
Things are rarely so easy. ‘The outcome knowledge poisons the ability of after-accident observers to recreate the view of practitioners before the accident of those same factors,’ writes Richard Cook. This hindsight bias ‘makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case.’1 Don Norman further observes that even when analysing our own actions, ‘we are apt to blame ourselves.’
But ‘suppose the fault really lies in the device, so that lots of people have the same problems. Because everyone perceives the fault to be their own, nobody wants to admit to having trouble. This creates a conspiracy of silence,’ resulting in fault never being addressed.2
It’s easy to finish a post-accident analysis by pointing at a human error. Especially when there exist numerous safety procedures, devices and fallback mechanisms designed to prevent catastrophic outcomes. However, as James Reason warns us, while ‘this is an understandable reactions, it nonetheless blocks the discovery of effective countermeasures and contribute to further fallible decisions.’3
If what we are concerned with is efficiency and safety of systems, assigning blame to individuals is counterproductive. Firing ‘responsible’ party in particular may ironically cause more accidents rather than preventing them. Having experienced a failure, the operator ‘responsible’ for it is better equipped to handle similar situation in the future. Meanwhile, a new person is prone to making the same error.
Fundamentally, people make mistakes and if they fear being held accountable they are less likely to admit to errors. This destroys a feedback loop which allows for latent failures to be analysed and addressed. An organisation which repair bad outcomes by disciplining ‘offending’ operators doesn’t end up with operators who make no mistakes; it ends up with operators who are great at hiding their mistakes.
This is why practice of blameless postmortems (where incidents are analysed without assigning blame) is important. And it is why attributing the root cause to human error is feckless.
Monday, 06 January 2025
Replaying the Microcomputing Revolution
Since microcomputing and computing history are particular topics of interest of mine, I was naturally engaged by a recent article about the Raspberry Pi and its educational ambitions. Perhaps obscured by its subsequent success in numerous realms, the aspirations that originally drove the development of the Pi had their roots in the effects of the introduction of microcomputers in British homes and schools during the 1980s, a phenomenon that supposedly precipitated a golden age of hands-on learning, initiating numerous celebrated and otherwise successful careers in computing and technology.
Such mythology has the tendency to greaten expectations and deepen nostalgia, and when society enters a malaise in one area or another, it often leads to efforts to bring back the magic through new initiatives. Enter the Raspberry Pi! But, as always, we owe it to ourselves to step through the sequence of historical events, as opposed to simply accepting the narratives peddled by those with an agenda or those looking for comforting reminders of their own particular perspectives from an earlier time.
The Raspberry Pi and other products, such as the BBC Micro Bit, associated with relatively recent educational initiatives, were launched with the intention of restoring the focus of learning about computing to that of computing and computation itself. Once upon a time, computers were largely confined to large organisations and particular kinds of endeavour, generally interacting only indirectly with wider society. Thus, for most people, what computers were remained an abstract notion, often coupled with talk of the binary numeral system as the “language” of these mysterious and often uncompromising machines.
However, as microcomputers emerged both in the hobbyist realm – frequently emphasised in microcomputing history coverage – and in commercial environments such as shops and businesses, governments and educators identified a need for “computer literacy”. This entailed practical experience with computers and their applications, informed by suitable educational material, enabling the broader public to understand the limitations and the possibilities of these machines.
Although computers had already been in use for decades, microcomputing diminished the cost of accessible computing systems and thereby dramatically expanded their reach. And when technology is adopted by a much larger group, there is usually a corresponding explosion in applications of that technology as its users make their own discoveries about what the technology might be good for. The limitations of microcomputers relative to their more sophisticated predecessors – mainframes and minicomputers – also meant that existing, well-understood applications were yet to be successfully transferred from those more powerful and capable systems, leaving the door open for nimble, if somewhat less capable, alternatives to be brought to market.
The Capable User
All of these factors pointed towards a strategy where users of computers would not only need to be comfortable interacting with these systems, but where they would also need to have a broad range of skills and expertise, allowing them to go beyond simply using programs that other people had made. Instead, they would need to be empowered to modify existing programs and even write their own. With microcomputers only having a limited amount of memory and often less than convenient storage solutions (cassette tapes being a memorable example), and with few available programs for typically brand new machines, the emphasis of the manufacturer was often on giving the user the tools to write their own software.
Computer literacy efforts sensibly and necessarily went along with such trends, and from the late 1970s and early 1980s, after broader educational programmes seeking to inform the public about microelectronics and computing, these efforts targeted existing models of computer with learning materials like “30 Hour BASIC”. Traditional publishers became involved as the market opportunities grew for producing and selling such materials, and publications like Usbourne’s extensive range of computer programming titles were incredibly popular.
Numerous microcomputer manufacturers were founded, some rather more successful and long-lasting than others. An industry was born, around which was a vibrant community – or many vibrant communities – consuming software and hardware for their computers, but crucially also seeking to learn more about their machines and exchanging their knowledge, usually through the specialist print media of the day: magazines, newsletters, bulletins and books. This, then, was that golden age, of computer studies lessons at school, learning BASIC, and of late night coders at home, learning machine code (or, more likely, assembly language) and gradually putting together that game they always wanted to write.
One can certainly question the accuracy of the stereotypical depiction of that era, given that individual perspectives may vary considerably. My own experiences involved limited exposure to educational software at primary school, and the anticipated computer studies classes at secondary school never materialising. What is largely beyond dispute is that after the exciting early years of microcomputing, the educational curriculum changed focus from learning about computers to using them to run whichever applications happened to be popular or attractive to potential employers.
The Vocational Era
Thus, microcomputers became mere tools to do other work, and in that visionless era of Thatcherism, such other work was always likely to be clerical: writing letters and doing calculations in simple spreadsheets, sowing the seeds of dysfunction and setting public expectations of information systems correspondingly low. “Computer studies” became “information technology” in the curriculum, usually involving systems feigning a level of compatibility with the emerging IBM PC “standard”. Naturally, better-off schools will have had nicer equipment, perhaps for audio and video recording and digitising, plus the accompanying multimedia authoring tools, along with a somewhat more engaging curriculum.
At some point, the Internet will have reached schools, bringing e-mail and Web access (with all the complications that entails), and introducing another range of practical topics. Web authoring and Web site development may, if pursued to a significant extent, reveal such things as scripts and services, but one must then wonder what someone encountering the languages involved for the first time might be able to make of them. A generation or two may have grown up seeing computers doing things but with no real exposure to how the magic was done.
And then, there is the matter of how receptive someone who is largely unexposed to programming might be to more involved computing topics, lower-level languages, data structures and algorithms, of the workings of the machine itself. The mythology would have us believe that capable software developers needed the kind of broad exposure provided by the raw, unfiltered microcomputing experience of the 1980s to be truly comfortable and supremely effective at any level of a computing system, having sniffed out every last trick from their favourite microcomputer back in the day.
Those whose careers were built in those early years of microcomputing may now be seeing their retirement approaching, at least if they have not already made their millions and transitioned into some kind of role advising the next generation of similarly minded entrepreneurs. They may lament the scarcity of local companies in the technology sector, look at their formative years, and conclude that the system just doesn’t make them like they used to.
(Never mind that the system never made them like that in the first place: all those game-writing kids who may or may not have gone on to become capable, professional developers were clearly ignoring all the high-minded educational stuff that other people wanted them to study. Chess computers and robot mice immediately spring to mind.)
A Topic for Another Time
What we probably need to establish, then, is whether such views truly incorporate the wealth of experience present in society, or whether they merely reflect a narrow perspective where the obvious explanation may apply to some people’s experience but fails to explain the entire phenomenon. Here, we could examine teaching at a higher educational level than the compulsory school system, particularly because academic institutions were already performing and teaching computing for decades before controversies about the school computing curriculum arose.
We might contrast the casual, self-taught, experimental approach to learning about programming and computers with the structured approach favoured in universities, of starting out with high-level languages, logic, mathematics, and of learning about how the big systems achieved their goals. I encountered people during my studies who had clearly enjoyed their formative experiences with microcomputers becoming impatient with the course of these studies, presumably wondering what value it provided to them.
Some of them quit after maybe only a year, whereas others gained an ordinary degree as opposed to graduating with honours, but hopefully they all went on to lucrative and successful careers, unconstrained and uncurtailed by their choice. But I feel that I might have missed some useful insights and experiences had I done the same. But for now, let us go along with the idea that constructive exposure to technology throughout the formative education of the average person enhances their understanding of that technology, leading to a more sophisticated and creative population.
A Complete Experience
Backtracking to the article that started this article off, we then encounter one educational ambition that has seemingly remained unaddressed by the Raspberry Pi. In microcomputing’s golden age, the motivated learner was ostensibly confronted with the full power of the machine from the point of switching on. They could supposedly study the lowest levels and interact with them using their own software, comfortable with their newly acquired knowledge of how the hardware works.
Disregarding the weird firmware situation with the Pi, it may be said that most Pi users will not be in quite the same position when running the Linux-based distribution deployed on most units as someone back in the 1980s with their BBC Micro, one of the inspirations for the Pi. This is actually a consequence of how something even cheaper than a microcomputer of an earlier era has gained sophistication to such an extent that it is architecturally one of those “big systems” that stuffy university courses covered.
In one regard, the difference in nature between the microcomputers that supposedly conferred developer prowess on a previous generation and the computers that became widespread subsequently, including single-board computers like the Pi, undermines the convenient narrative that microcomputers gave the earlier generation their perfect start. Systems built on processors like the 6502 and the Z80 did not have different privilege levels or memory management capabilities, leaving their users blissfully unaware of such concepts, even if the curious will have investigated the possibilities of interrupt handling and been exposed to any related processor modes, or even if some kind of bank switching or simple memory paging had been used by some machines.
Indeed, topics relevant to microcomputers from the second half of the 1980s are surprisingly absent from retrocomputing initiatives promoting themselves as educational aids. While the Commander X16 is mostly aimed at those seeking a modern equivalent of their own microcomputer learning environment, and many of its users may also end up mostly playing games, the Agon Light and related products are more aggressively pitched as being educational in nature. And yet, these projects cling to 8-bit processors, some inviting categorisation as being more like microcontrollers than microprocessors, as if the constraints of those processor architectures conferred simplicity. In fact, moving up from the 6502 to the 68000 or ARM made life easier in many ways for the learner.
When pitching a retrocomputing product at an audience with the intention of educating them about computing, also adding some glamour and period accuracy to the exercise, it would arguably be better to start with something from the mid-1980s like the Atari ST, providing a more scalable processor architecture and sensible instruction set, but also coupling the processor with memory management hardware. The Atari ST and Commodore Amiga didn’t have a memory management unit in their earliest models, only introducing one later to attempt a move upmarket.
Certainly, primary school children might not need to learn the details of all of this power – just learning programming would be sufficient for them – but as they progress into the later stages of their education, it would be handy to give them new challenges and goals, to understand how a system works where each program has its own resources and cannot readily interfere with other programs. Indeed, something with a RISC processor and memory management capabilities would be just as credible.
How “authentic” a product with a RISC processor and “big machine” capabilities would be, in terms of nostalgia and following on from earlier generations of products, might depend on how strict one decides to be about the whole exercise. But there is nothing inauthentic about a product with such a feature set. In fact, one came along as the de-facto successor to the BBC Micro, and yet relatively little attention seems to be given to how it addressed some of the issues faced by the likes of the Pi.
Under The Hood
In assessing the extent of the Pi’s educational scope, the aforementioned article has this to say:
“Encouraging naive users to go under the hood is always going to be a bad idea on systems with other jobs to do.”
For most people, the Pi is indeed running many jobs and performing many tasks, just as any Linux system might do. And as with any “big machine”, the user is typically and deliberately forbidden from going “under the hood” and interfering with the normal functioning of the system. Even if a Pi is only hosting a single user, unlike the big systems of the past with their obligations to provide a service to many users.
Of course, for most purposes, such a system has traditionally been more than adequate for people to learn about programming. But traditionally, low-level systems programming and going under the hood generally meant downtime, which on expensive systems was largely discouraged, confined to inconvenient times of day, and potentially undertaken at one’s peril. Things have changed somewhat since the old days, however, and we will return to that shortly. But satisfying the expectations of those wanting a responsive but powerful learning environment was a challenge encountered even as the 1980s played out.
With early 1980s microcomputers like the BBC Micro, several traits comprised the desirable package that people now seek to reproduce. The immediacy of such systems allowed users to switch on and interact with the computer in only a few seconds, as opposed to a lengthy boot sequence that possibly also involved inserting disks, never mind the experiences of the batch computing era that earlier computing students encountered. Such interactivity lent such systems a degree of transparency, letting the user interact with the system and rapidly see the effects. Interactions were not necessarily constrained to certain facets of the system, allowing users to engage with the mechanisms “under the hood” with both positive and negative effects.
The Machine Operating System (MOS) of the BBC Micro and related machines such as the Acorn Electron and BBC Master series, provided well-defined interfaces to extend the operating system, introduce event or interrupt handlers, to deliver utilities in the form of commands, and to deliver languages and applications. Such capabilities allowed users to explore the provided functionality and the framework within which it operated. Users could also ignore the operating system’s facilities and more or less take full control of the machine, slipping out of one set of imposed constraints only to be bound by another, potentially more onerous set of constraints.
Earlier Experiences
Much is made of the educational impact of systems like the BBC Micro by those wishing to recapture some of the magic on more capable systems, but relatively few people seem to be curious about how such matters were tackled by the successor to the BBC Micro and BBC Master ranges: Acorn’s Archimedes series. As a step away from earlier machines, the Archimedes offers an insight into how simplicity and immediacy can still be accommodated on more powerful systems, through native support for familiar technology such as BASIC, compatibility layers for old applications, and system emulators for those who need to exercise some of the new hardware in precisely the way that worked on the older hardware.
When the Archimedes was delivered, the original Arthur operating system largely provided the recognisable BBC Micro experience. Starting up showed a familiar welcome message, and even if it may have dropped the user at a “supervisor” prompt as opposed to BASIC, something which did also happen occasionally on earlier machines, typing “BASIC” got the user the rest of the way to the environment they had come to expect. This conferred the ability to write programs exercising the graphical and audio capabilities of the machine to a substantial degree, including access to assembly language, albeit of a different and rather superior kind to that of the earlier machines. Even writing directly to screen memory worked, albeit at a different location and with a more sensible layout.
Under Arthur, users could write programs largely as before, with differences attributable to the change in capabilities provided by the new machines. Even though errant pokes to exotic memory locations might have been trapped and handled by the system’s enhanced architecture, it was still possible to write software that ran in a privileged mode, installed interrupt handlers, and produced clever results, at the risk of freezing or crashing the system. When Arthur was superseded by RISC OS, the desktop interface became the default experience, hiding the immediacy and the power of the command prompt and BASIC, but such facilities remained only a keypress away and could be configured as the default with perhaps only a single command.
RISC OS exposed the tensions between the need for a more usable and generally accessible interface, potentially doing many things at once, and the desire to be able to get under the hood and poke around. It was possible to write desktop applications in BASIC, but this was not really done in a particularly interactive way, and programs needed to make system calls to interact with the rest of the desktop environment, even though the contents of windows were painted using the classic BASIC graphics primitives otherwise available to programs outside the desktop. Desktop programs were also expected to cooperate properly with each other, potentially hanging the system if not written correctly.

The Maestro music player in RISC OS, written in BASIC. Note that the !RunImage file is a BASIC program, with the somewhat compacted code shown in the text editor.
A safer option for those wanting the classic experience and to leverage their hard-earned knowledge, was to forget about the desktop and most of the newer capabilities of the Archimedes and to enter the BBC Micro emulator, 65Host, available on one of the supplied application disks, writing software just as before, and then running that software or any other legacy software of choice. Apart from providing file storage to the emulator and bearing all the work of the emulator itself, this did not really exercise the newer machine, but it still provided a largely authentic, traditional experience. One could presumably crash the emulated machine, but this should merely have terminated the emulator.
An intermediate form of legacy application support was also provided. 65Tube, with “Tube” referencing an interfacing paradigm used by the BBC Micro, allowed applications written against documented interfaces to run under emulation but accessing facilities in the native environment. This mostly accommodated things like programming language environments and productivity applications and might have seemed superfluous alongside the provision of a more comprehensive emulator, but it potentially allowed such applications to access capabilities that were not provided on earlier systems, such as display modes with greater resolutions and more colours, or more advanced filesystems of different kinds. Importantly, from an educational perspective, these emulators offered experiences that could be translated to the native environment.

65Tube running in MODE 15, utilising many more colours than normally available on earlier Acorn machines.
Although the Archimedes drifted away from the apparent simplicity of the BBC Micro and related machines, most users did not fully understand the software stack on such earlier systems, anyway. However, despite the apparent sophistication of the BBC Micro’s successors, various aspects of the software architecture were, in fact, preserved. Even the graphical user interface on the Archimedes was built upon many familiar concepts and abstractions. The difficulty for users moving up to the newer system arose upon finding that much of their programming expertise and effort had to be channelled into a software framework that confined the activities of their code, particularly in the desktop environment. One kind of framework for more advanced programs had merely been replaced by others.
Finding Lessons for Today
The way the Archimedes attempted to accommodate the expectations cultivated by earlier machines does not necessarily offer a convenient recipe to follow today. However, the solutions it offered should draw our attention to some other considerations. One is the level of safety in the environment being offered: it should be possible to interact with the system without bringing it down or causing havoc.
In that respect, the Archimedes provided a sandboxed environment like an emulator, but this was only really viable for running old software, as indeed was the intention. It also did not multitask, although other emulators eventually did. The more integrated 65Tube emulator also did not multitask, although later enhancements to RISC OS such as task windows did allow it to multitask to a degree.

65Tube running in a task window. This relies on the text editing application and unfortunately does not support fancy output.
Otherwise, the native environment offered all the familiar tools and the desired level of power, but along with them plenty of risks for mayhem. Thus, a choice between safety and concurrency was forced upon the user. (Aside from Arthur and RISC OS, there was also Acorn’s own Unix port, RISC iX, which had similar characteristics to the kind of Linux-based operating system typically run on the Pi. You could, in principle, run a BBC Micro emulator under RISC iX, just as people run emulators on the Pi today.)
Today, we could actually settle for the same software stack on some Raspberry Pi models, with all its advantages and disadvantages, by running an updated version of RISC OS on such hardware. The bundled emulator support might be missing, however, but for those wanting to go under the hood and also take advantage of the hardware, it is unlikely that they would be so interested in replicating the original BBC Micro experience with perfect accuracy, instead merely seeking to replicate the same kind of experience.
Another consideration the Archimedes raises is the extent to which an environment may take advantage of the host system, and it is this consideration that potentially has the most to offer in formulating modern solutions. We may normally be completely happy running a programming tool in our familiar computing environments, where graphical output, for example, may be confined to a window or occasionally shown in full-screen mode. Indeed, something like a Raspberry Pi need not have any rigid notion of what its “native” graphical capabilities are, and the way a framebuffer is transferred to an actual display is normally not of any real interest.
The learning and practice of high-level programming can be adequately performed in such a modern environment, with the user safely confined by the operating system and mostly unable to bring the system down. However, it might not adequately expose the user to those low-level “under the hood” concepts that they seem to be missing out on. For example, we may wish to introduce the framebuffer transfer mechanism as some kind of educational exercise, letting the user appreciate how the text and graphics plotting facilities they use lead to pixels appearing on their screen. On the BBC Micro, this would have involved learning about how the MOS configures the 6845 display controller and the video ULA to produce a usable display.
The configuration of such a mechanism typically resides at a fairly low level in the software stack, out of the direct reach of the user, but allowing a user to reconfigure such a mechanism would risk introducing disruption to the normal functioning of the system. Therefore, a way is needed to either expose the mechanism safely or to simulate it. Here, technology’s steady progression does provide some possibilities that were either inconvenient or impossible on an early ARM system like the Archimedes, notably virtualisation support, allowing us to effectively run a simulation of the hardware efficiently on the hardware itself.
Thus, we might develop our own framebuffer driver and fire up a virtual machine running our operating system of choice, deploying the driver and assessing the consequences provided by a simulation of that aspect of the hardware. Of course, this would require support in the virtual environment for that emulated element of the hardware. Alternatively, we might allow some kind of restrictive access to that part of the hardware, risking the failure of the graphical interface if misconfiguration occurred, but hopefully providing some kind of fallback control mechanism, like a serial console or remote login, to restore that interface and allow the errant code to be refined.
A less low-level component that might invite experimentation could be a filesystem. The MOS in the BBC Micro and related machines provided filesystem (or filing system) support in the form of service ROMs, and in RISC OS on the Archimedes such support resides in the conceptually similar relocatable modules. Given the ability of normal users to load such modules, it was entirely possible for a skilled user to develop and deploy their own filesystem support, with the associated risks of bringing down the system. Linux does have arguably “glued-on” support for unprivileged filesystem deployment, but there might be other components in the system worthy of modification or replacement, and thus the virtual machine might need to come into play again to allow the desired degree of experimentation.
A Framework for Experimentation
One can, however, envisage a configurable software system where a user session might involve a number of components providing the features and services of interest, and where a session might be configured to exclude or include certain typical or useful components, to replace others, and to allow users to deploy their own components in a safe fashion. Alongside such activities, a normal system could be running, providing access to modern conveniences at a keypress or the touch of a button.
We might want the flexibility to offer something resembling 65Host, albeit without the emulation of an older system and its instruction set, for a highly constrained learning environment where many aspects of the system can be changed for better or worse. Or we might want something closer to 65Tube, again without the emulation, acting mostly as a “native” program but permitting experimentation on a few elements of the experience. An entire continuum of possibilities could be supported by a configurable framework, allowing users to progress from a comfortable environment with all of the expected modern conveniences, gradually seeing each element removed and then replaced with their own implementation, until arriving in an environment where they have the responsibility at almost every level of the system.
In principle, a modern system aiming to provide an “under the hood” experience merely needs to simulate that experience. As long as the user experiences the same general effects from their interactions, the environment providing the experience can still isolate a user session from the underlying system and avoid unfortunate consequences from that misbehaving session. Purists might claim that as long as any kind of simulation is involved, the user is not actually touching the hardware and is therefore not engaging in low-level development, even if the code they are writing would be exactly the code that would be deployed on the hardware.
Systems programming can always be done by just writing programs and deploying them on the hardware or in a virtual machine to see if they work, resetting the system and correcting any mistakes, which is probably how most programming of this kind is done even today. However, a suitably configurable system would allow a user to iteratively and progressively deploy a customised system, and to work towards deploying a complete system of their own. With the final pieces in place, the user really would be exercising the hardware directly, finally silencing the purists.
Naturally, given my interest in microkernel-based systems, the above concept would probably rest on the use of a microkernel, with much more of a blank canvas available to define the kind of system we might like, as opposed to more prescriptive systems with monolithic kernels and much more of the basic functionality squirrelled away in privileged kernel code. Perhaps the only difficult elements of a system to open up to user modification, those that cannot also be easily delegated or modelled by unprivileged components, would be those few elements confined to the microkernel and performing fundamental operations such as directly handling interrupts, switching execution contexts (threads), writing memory mappings to the appropriate registers, and handling system calls and interprocess communications.
Even so, many aspects of these low-level activities are exposed to user-level components in microkernel-based operating systems, leaving few mysteries remaining. For those advanced enough to progress to kernel development, traditional systems programming practices would surely be applicable. But long before that point, motivated learners will have had plenty of opportunities to get “under the hood” and to acquire a reasonable understanding of how their systems work.
A Conclusion of Sorts
As for why people are not widely using the Raspberry Pi to explore low-level computing, the challenge of facilitating such exploration when the system has “other jobs to do” certainly seems like a reasonable excuse, especially given the choice of operating system deployed on most Pi devices. One could remove those “other jobs” and run RISC OS, of course, putting the learner in an unfamiliar and more challenging environment, perhaps giving them another computer to use at the same time to look things up on the Internet. Or one could adopt a different software architecture, but that would involve an investment in software that few organisations can be bothered to make.
I don’t know whether the University of Cambridge has seen better-educated applicants in recent years as a result of Pi proliferation, or whether today’s applicants are as similarly perplexed by low-level concepts as those from the pre-Pi era. But then, there might be a lesson to be learned about applying some rigour to technological interventions in society. After all, there were some who justifiably questioned the effectiveness of rolling out microcomputers in schools, particularly when teachers have never really been supported in their work, as more and more is asked of them by their political overlords. Investment in people and their well-being is another thing that few organisations can be bothered to make, too.
Sunday, 29 December 2024
Beware of Composable Foundation
So far I have been lucky in my professional life. I have never had any conflicts with my employers and for the most part maintained good rapport with coworkers and managers alike. Alas, my luck has finally run out.
Long story short, I left Composable Foundation in October and I am still waiting for my final paycheck. TL;DR: If you are doing business with them make sure you are paid in advance.
The issues started around June when my invoices stopped getting paid. I did not press the issue immediately and only started asking about it on my last month. I had been reassured that all remaining payments would be made on last day of my contract.
They were not.
I started nagging people some more about the payments and I kept getting reassured that the transfers were coming. While money had been trickling in, the deadlines and amounts kept shifting. Was that some kind of delay tactic? But I do not understand what the delay could possibly achieve.
Omar Zaki, Composable’s CEO who is better known online as 0xbrainjar, claimed that ‘he’ll get [the final transfer] done by Friday’. When that did not happen, he claimed that ‘it’ll be done by the time [I] wake up Saturday.’ And yes, you have guessed it, that did not happen either. I waited till Saturday noon New York time to message him about it and only then transfer happened. Alas, rather than for full amount, it was less than half of the invoice value.
At this point I gave up. If the remaining amount ever arrives I will update this post, but I consider the money lost.
The saddest part is that I am working with University of Lisbon on a handful research projects which started as collaboration between Composable and the university. With my recent experience with Composable I worry about future of that research. Though I have single-handled maintained that collaboration going for last few months so hopefully we can finish what we have started even without Composable’s involvement.
Thursday, 19 December 2024
Atlantis Azure Devops check PR approvals
Atlantis is a Terraform Pull Request Automation platform, pretty everybody in your organization can modify terraform code and run plan and apply, that introduce some security/authorization problems that must be properly addressed.
I’ve created a shell script that connect to Azure Devops and check if the PR has been approved by a member of one or more groups, so you can make the PR require an approve by the devops/infrastructure team before the code can be executed in the plan or apply phase.
You should make this script invoked by atlantis during a custom workflow, this will reject a PR that has not been approved by a member of one or more specific groups.
This can be useful to everyone who is using Atlantis with Azure Devops, so i’ve released it on github: https://github.com/davidegiunchi/atlantis-azdevops-check-pr-approvals
Saturday, 14 December 2024
Dual Screen CI20
Following on from yesterday’s post, where a small display was driven over SPI from the MIPS Creator CI20, it made sense to exercise the HDMI output again. With a few small fixes to the configuration files, demonstrating that the HDMI output still worked, I suppose one thing just had to be done: to drive both displays at the same time.

The MIPS Creator CI20 driving an SPI display and a monitor via HDMI.
Thus, two separate instances of the spectrum example, each utilising their own framebuffer, potentially multiplexed with other programs (but not actually done here), are displayed on their own screen. All it required was a configuration that started all the right programs and wired them up.
Again, we may contemplate what the CI20 was probably supposed to be: some kind of set-top box providing access to media files stored on memory cards or flash memory, possibly even downloaded from the Internet. On such a device, developed further into a product, there might well have been a front panel display indicating the status of the device, the current media file details, or just something as simple as the time and date.
Here, an LCD is used and not in any sensible orientation for use in such a product, either. We would want to use some kind of right-angle connector to make it face towards the viewer. Once upon a time, vacuum fluorescent displays were common for such applications, but I could imagine a simple, backlit, low-resolution monochrome LCD being an alternative now, maybe with RGB backlighting to suit the user’s preferences.
Then again, for prototyping, a bright LCD like this, decadent though it may seem, somehow manages to be cheaper than much simpler backlit, character matrix displays. And I also wonder how many people ever attached two displays to their CI20.
Friday, 13 December 2024
Testing Newer Work on Older Boards
Since I’ve been doing some housekeeping in my low-level development efforts, I had to get the MIPS Creator CI20 out and make sure I hadn’t broken too much, also checking that the newer enhancements could be readily ported to the CI20’s pinout and peripherals. It turns out that the Pimoroni Pirate Audio speaker board works just fine on the primary expansion header, at least to use the screen, and doesn’t need the backlight pin connected, either.

The Pirate Audio speaker hat on the MIPS Creator CI20.
Of course, the CI20 was designed to be pinout-compatible with the original Raspberry Pi, which had a 26-pin expansion header. This was replaced by a 40-pin header in subsequent Raspberry Pi models, presumably wrongfooting various suppliers of accessories, but the real difficulties will have been experienced by those with these older boards, needing to worry about whether newer, 40-pin “hat” accessories could be adapted.
To access the Pirate Audio hat’s audio support, some additional wiring would, in principle, be necessary, but the CI20 doesn’t expose I2S functionality via its headers. (The CI20 has a more ambitious audio architecture involving a codec built into the JZ4780 SoC and a wireless chip capable of Bluetooth audio, not that I’ve ever exercised this even under Linux.) So, this demonstration is about as far as we can sensibly get with the CI20. I also tested the Waveshare panel and it seemed to work, too. More testing remains, of course!
Thursday, 05 December 2024
A Small Update
Following swiftly on from my last article, I decided to take the opportunity to extend my framebuffer components to support an interface utilised by the L4Re framework’s Mag component, which is a display multiplexer providing a kind of multiple window environment. I’m not sure if Mag is really supported any more, but it provided the basis of a number of L4Re examples for a while, and I brought it into use for my own demonstrations.
Eventually, having needed to remind myself of some of the details of my own software, I managed to deploy the collection of components required, each with their own specialised task, but most pertinently a SoC-specific SPI driver and a newly extended display-specific framebuffer driver. The framebuffer driver could now be connected directly to Mag in the Lua-based coordination script used by the Ned initialisation program, which starts up programs within L4Re, and Mag could now request a region of memory from the framebuffer driver for further use by other programs.
All of this extra effort merely provided another way of delivering a familiar demonstration, that being the colourful, mesmerising spectrum example once provided as part of the L4Re software distribution. This example also uses the programming interface mentioned above to request a framebuffer from Mag. It then plots its colourful output into this framebuffer.
The result is familiar from earlier articles:

The spectrum example on a screen driven by the ILI9486 controller.
The significant difference, however, is that underneath the application programs, a combination of interchangeable components provides the necessary adaptation to the combination of hardware devices involved. And the framebuffer component can now completely replace the fb-drv component that was also part of the L4Re distribution, thereby eliminating a dependency on a rather cumbersome and presumably obsolete piece of software.
Monday, 02 December 2024
Recent Progress
The last few months have not always been entirely conducive to making significant progress with various projects, particularly my ongoing investigations and experiments with L4Re, but I did manage to reacquaint myself with my previous efforts sufficiently to finally make some headway in November. This article tries to retrieve some of the more significant accomplishments, modest as they might be, to give an impression of how such work is undertaken.
Previously, I had managed to get my software to do somewhat useful things on MIPS-based single-board computer hardware, showing graphical content on a small screen. Various problems had arisen with regard to one revision of a single-board computer for which the screen was originally intended, causing me to shift my focus to more general system functionality within L4Re. With the arrival of the next revision of the board, I leveraged this general functionality, combining it with support for memory cards, to get my minimalist system to operate on the board itself. I rather surprised myself getting this working, it must be said.
Returning to the activity at the start of November, there were still some matters to be resolved. In parallel to my efforts with L4Re, I had been trying to troubleshoot the board’s operation under Linux. Linux is, in general, a topic upon which I do not wish to waste my words. However, with the newer board revision, I had also acquired another, larger, screen and had been investigating its operation, and there were performance-related issues experienced under Linux that needed to be verified under other conditions. This is where a separate software environment can be very useful.
Plugging a Leak
Before turning my attention to the larger screen, I had been running a form of stress test with the smaller screen, updating it intensively while also performing read operations from the memory card. What this demonstrated was that there were no obvious bandwidth issues with regard to data transfers occurring concurrently. Translating this discovery back to Linux remains an ongoing exercise, unfortunately. But another problem arose within my own software environment: after a while, the filesystem server would run out of memory. I felt that this problem now needed to be confronted.
Since I tend to make such problems for myself, I suspected a memory leak in some of my code, despite trying to be methodical in the way that allocated objects are handled. I considered various tools that might localise this particular leak, with AddressSanitizer and LeakSanitizer being potentially useful, merely requiring recompilation and being available for a wide selection of architectures as part of GCC. I also sought to demonstrate the problem in a virtual environment, this simply involving appropriate test programs running under QEMU. Unfortunately, the sanitizer functionality could not be linked into my binaries, at least with the Debian toolchains that I am using.
Eventually, I resolved to use simpler techniques. Wondering if the memory allocator might be fragmenting memory, I introduced a call to malloc_stats, just to get an impression of the state of the heap. After failing to gain much insight into the problem, I rolled up my sleeves and decided to just look through my code for anything I might have done with regard to allocating memory, just to see if I had overlooked anything as I sought to assemble a working system from its numerous pieces.
Sure enough, I had introduced an allocation for “convenience” in one kind of object, making a pool of memory available to that object if no specific pool had been presented to it. The memory pool itself would release its own memory upon disposal, but in focusing on getting everything working, I had neglected to introduce the corresponding top-level disposal operation. With this remedied, my stress test was now able to run seemingly indefinitely.
Separating Displays and Devices
I would return to my generic system support later, but the need to exercise the larger screen led me to consider the way I had previously introduced support for screens and displays. The smaller screen employs SPI as the communications mechanism between the SoC and the display controller, as does the larger screen, and I had implemented support for the smaller screen as a library combining the necessary initialisation and pixel data transfer code with code that would directly access the SPI peripheral using a SoC-specific library.
Clearly, this functionality needed to be separated into two distinct parts: the code retaining the details of initialising and operating the display via its controller, and the code performing the SPI communication for a specific SoC. Not doing this could require us to needlessly build multiple variants of the display driver for different SoCs or platforms, when in principle we should only need one display driver with knowledge of the controller and its peculiarities, this then being combined using interprocess communication with a single, SoC-specific driver for the communications.
A few years ago now, I had in fact implemented a “server” in L4Re to perform short SPI transfers on the Ben NanoNote, this to control the display backlight. It became appropriate to enhance this functionality to allow programs to make longer transfers using data held in shared memory, all of this occurring without those programs having privileged access to the underlying SPI peripheral in the SoC. Alongside the SPI server appropriate for the Ben NanoNote’s SoC, servers would be built for other SoCs, and only the appropriate one would be started on a given hardware device. This would then mediate access to the SPI peripheral, accepting requests from client programs within the established L4Re software architecture.
One important element in the enhanced SPI server functionality is the provision of shared memory that can be used for DMA transfers. Fortunately, this is mostly a matter of using the appropriate settings when requesting memory within L4Re, even though the mechanism has been made somewhat more complicated in recent times. It was also fortunate that I previously needed to consider such matters when implementing memory card support, saving me time in considering them now. The result is that a client program should be able to write into a memory region and the SPI server should be able to send the written data directly to the display controller without any need for additional copying.
Complementing the enhanced SPI servers are framebuffer components that use these servers to configure each kind of display, each providing an interface to their own client programs which, in turn, access the display and provide visual content. The smaller screen uses an ST7789 controller and is therefore supported by one kind of framebuffer component, whereas the larger screen uses an ILI9486 controller and has its own kind of component. In principle, the display controller support could be organised so that common code is reused and that support for additional controllers would only need specialisations to that generic code. Both of these controllers seem to implement the MIPI DBI specifications.
The particular display board housing the larger screen presented some additional difficulties, being very peculiarly designed to present what would seem to be an SPI interface to the hardware interfacing to the board, but where the ILI9486 controller’s parallel interface is apparently used on the board itself, with some shift registers and logic faking the serial interface to the outside world. This complicates the communications, requiring 16-bit values to be sent where 8-bit values would be used in genuine SPI command traffic.
The motivation for this weird design is presumably that of squeezing a little extra performance out of the controller that is only available when transferring pixel data via the parallel interface, especially desired by those making low-cost retrogaming systems with the Raspberry Pi. Various additional tweaks were needed to make the ILI9486 happy, such as an explicit reset pulse, with this being incorporated into my simplistic display component framework. Much more work is required in this area, and I hope to contemplate such matters in the not-too-distant future.
Discoveries and Remedies
Further testing brought some other issues to the fore. With one of the single-board computers, I had been using a microSD card with a capacity of about half a gigabyte, which would make it a traditional SD or SDSC (standard capacity) card, at least according to the broader SD card specifications. With another board, I had been using a card with a sixteen gigabyte capacity or thereabouts, aligning it with the SDHC (high capacity) format.
Starting to exercise my code a bit more on this larger card exposed memory mapping issues when accessing the card as a single region: on the 32-bit MIPS architecture used by the SoC, a pointer simply cannot address this entire region, and thus some pointer arithmetic occurred that had undesirable consequences. Constraining the size of mapped regions seemed like the easiest way of fixing this problem, at least for now.
More sustained testing revealed a couple of concurrency issues. One involved a path of invocation via a method testing for access to filesystem objects where I had overlooked that the method, deliberately omitting usage of a mutex, could be called from another component and thus circumvent the concurrency measures already in place. I may well have refactored components at some point, forgetting about this particular possibility.
Another issue was an oversight in the way an object providing access to file content releases its memory pages for other objects to use before terminating, part of the demand paging framework that has been developed. I had managed to overlook a window between two operations where an object seeking to acquire a page from the terminating object might obtain exclusive access to a page, but upon attempting to notify the terminating object, find that it has since been deallocated. This caused memory access errors.
Strangely, I had previously noticed one side of this potential situation in the terminating object, even writing up some commentary in the code, but I had failed to consider the other side of it lurking between those two operations. Building in the missing support involved getting the terminating object to wait for its counterparts, so that they may notify it about pages they were in the process of removing from its control. Hopefully, this resolves the problem, but perhaps the lesson is that if something anomalous is occurring, exhibiting certain unexpected effects, the cause should not be ignored or assumed to be harmless.
All of this proves to be quite demanding work, having to consider many aspects of a system at a variety of levels and across a breadth of components. Nevertheless, modest progress continues to be made, even if it is entirely on my own initiative. Hopefully, it remains of interest to a few of my readers, too.
Wednesday, 27 November 2024
Creating a kubernetes cluster with kubeadm on Ubuntu 24.04 LTS
(this is a copy of my git repo of this post)
https://github.com/ebal/k8s_cluster/
Kubernetes, also known as k8s, is an open-source system for automating deployment, scaling, and management of containerized applications.
Notice The initial (old) blog post with ubuntu 22.04 is (still) here: blog post
- Prerequisites
- Git Terraform Code for the kubernetes cluster
- Control-Plane Node
- Ports on the control-plane node
- Firewall on the control-plane node
- Hosts file in the control-plane node
- Updating your hosts file
- No Swap on the control-plane node
- Kernel modules on the control-plane node
- NeedRestart on the control-plane node
- temporarily
- permanently
- Installing a Container Runtime on the control-plane node
- Installing kubeadm, kubelet and kubectl on the control-plane node
- Get kubernetes admin configuration images
- Initializing the control-plane node
- Create user access config to the k8s control-plane node
- Verify the control-plane node
- Install an overlay network provider on the control-plane node
- Verify CoreDNS is running on the control-plane node
- Worker Nodes
- Get Token from the control-plane node
- Is the kubernetes cluster running ?
- Kubernetes Dashboard
- Helm
- Install kubernetes dashboard
- Accessing Dashboard via a NodePort
- Patch kubernetes-dashboard
- Edit kubernetes-dashboard Service
- Accessing Kubernetes Dashboard
- Create An Authentication Token (RBAC)
- Creating a Service Account
- Creating a ClusterRoleBinding
- Getting a Bearer Token
- Browsing Kubernetes Dashboard
- Nginx App
- That’s it
In this blog post, I’ll share my personal notes on setting up a kubernetes cluster using kubeadm on Ubuntu 24.04 LTS Virtual Machines.
For this setup, I will use three (3) Virtual Machines in my local lab. My home lab is built on libvirt with QEMU/KVM (Kernel-based Virtual Machine), and I use Terraform as the infrastructure provisioning tool.
Prerequisites
- at least 3 Virtual Machines of Ubuntu 24.04 (one for control-plane, two for worker nodes)
- 2GB (or more) of RAM on each Virtual Machine
- 2 CPUs (or more) on each Virtual Machine
- 20Gb of hard disk on each Virtual Machine
- No SWAP partition/image/file on each Virtual Machine
Streamline the lab environment
To simplify the Terraform code for the libvirt/QEMU Kubernetes lab, I’ve made a few adjustments so that all of the VMs use the below default values:
- ssh port: 22/TCP
- volume size: 40G
- memory: 4096
- cpu: 4
Review the values and adjust them according to your requirements and limitations.
Git Terraform Code for the kubernetes cluster
I prefer maintaining a reproducible infrastructure so that I can quickly create and destroy my test lab. My approach involves testing each step, so I often destroy everything, copy and paste commands, and move forward. I use Terraform to provision the infrastructure. You can find the full Terraform code for the Kubernetes cluster here: k8s cluster - Terraform code.
If you do not use terraform, skip this step!
You can git clone
the repo to review and edit it according to your needs.
git clone https://github.com/ebal/k8s_cluster.git
cd tf_libvirt
You will need to make appropriate changes. Open Variables.tf for that. The most important option to change, is the User option. Change it to your github username and it will download and setup the VMs with your public key, instead of mine!
But pretty much, everything else should work out of the box. Change the vmem and vcpu settings to your needs.
Initilaze the working directory
Init terraform before running the below shell script.
This action will download in your local directory all the required teffarorm providers or modules.
terraform init
Ubuntu 24.04 Image
Before proceeding with creating the VMs, we need to ensure that the Ubuntu 24.04 image is available on our system, or modify the code to download it from the internet.
In Variables.tf terraform file, you will notice the below entries
# The image source of the VM
# cloud_image = "https://cloud-images.ubuntu.com/oracular/current/focal-server-cloudimg-amd64.img"
cloud_image = "../oracular-server-cloudimg-amd64.img"
If you do not want to download the Ubuntu 24.04 cloud server image then make the below change
# The image source of the VM
cloud_image = "https://cloud-images.ubuntu.com/oracular/current/focal-server-cloudimg-amd64.img"
# cloud_image = "../oracular-server-cloudimg-amd64.img"
otherwise you need to download it, in the upper directory, to speed things up
cd ../
IMAGE="oracular" # 24.04
curl -sLO https://cloud-images.ubuntu.com/${IMAGE}/current/${IMAGE}-server-cloudimg-amd64.img
cd -
ls -l ../oracular-server-cloudimg-amd64.img
Spawn the VMs
We are ready to spawn our 3 VMs by running terraform plan
& terraform apply
./start.sh
output should be something like:
...
Apply complete! Resources: 16 added, 0 changed, 0 destroyed.
Outputs:
VMs = [
"192.168.122.223 k8scpnode1",
"192.168.122.50 k8swrknode1",
"192.168.122.10 k8swrknode2",
]
Verify that you have ssh access to the VMs
eg.
ssh ubuntu@192.168.122.223
Replace the IP with the one provided in the output.
DISCLAIMER if something failed, destroy everything with ./destroy.sh
to remove any garbages before run ./start.sh
again!!
Control-Plane Node
Let’s now begin configuring the Kubernetes control-plane node.
Ports on the control-plane node
Kubernetes runs a few services that needs to be accessable from the worker nodes.
Protocol | Direction | Port Range | Purpose | Used By |
---|---|---|---|---|
TCP | Inbound | 6443 | Kubernetes API server | All |
TCP | Inbound | 2379-2380 | etcd server client API | kube-apiserver, etcd |
TCP | Inbound | 10250 | Kubelet API | Self, Control plane |
TCP | Inbound | 10259 | kube-scheduler | Self |
TCP | Inbound | 10257 | kube-controller-manager | Self |
Although etcd ports are included in control plane section, you can also host your
own etcd cluster externally or on custom ports.
Firewall on the control-plane node
We need to open the necessary ports on the CP’s (control-plane node) firewall.
sudo ufw allow 6443/tcp
sudo ufw allow 2379:2380/tcp
sudo ufw allow 10250/tcp
sudo ufw allow 10259/tcp
sudo ufw allow 10257/tcp
# sudo ufw disable
sudo ufw status
the output should be
To Action From
-- ------ ----
22/tcp ALLOW Anywhere
6443/tcp ALLOW Anywhere
2379:2380/tcp ALLOW Anywhere
10250/tcp ALLOW Anywhere
10259/tcp ALLOW Anywhere
10257/tcp ALLOW Anywhere
22/tcp (v6) ALLOW Anywhere (v6)
6443/tcp (v6) ALLOW Anywhere (v6)
2379:2380/tcp (v6) ALLOW Anywhere (v6)
10250/tcp (v6) ALLOW Anywhere (v6)
10259/tcp (v6) ALLOW Anywhere (v6)
10257/tcp (v6) ALLOW Anywhere (v6)
Hosts file in the control-plane node
We need to update the /etc/hosts
with the internal IP and hostname.
This will help when it is time to join the worker nodes.
echo $(hostname -I) $(hostname) | sudo tee -a /etc/hosts
Just a reminder: we need to update the hosts file to all the VMs.
To include all the VMs’ IPs and hostnames.
If you already know them, then your /etc/hosts
file should look like this:
192.168.122.223 k8scpnode1
192.168.122.50 k8swrknode1
192.168.122.10 k8swrknode2
replace the IPs to yours.
Updating your hosts file
if you already the IPs of your VMs, run the below script to ALL 3 VMs
sudo tee -a /etc/hosts <<EOF
192.168.122.223 k8scpnode1
192.168.122.50 k8swrknode1
192.168.122.10 k8swrknode2
EOF
No Swap on the control-plane node
Be sure that SWAP is disabled in all virtual machines!
sudo swapoff -a
and the fstab file should not have any swap entry.
The below command should return nothing.
sudo grep -i swap /etc/fstab
If not, edit the /etc/fstab
and remove the swap entry.
If you follow my terraform k8s code example from the above github repo,
you will notice that there isn’t any swap entry in the cloud init (user-data) file.
Nevertheless it is always a good thing to douple check.
Kernel modules on the control-plane node
We need to load the below kernel modules on all k8s nodes, so k8s can create some network magic!
- overlay
- br_netfilter
Run the below bash snippet that will do that, and also will enable the forwarding features of the network.
sudo tee /etc/modules-load.d/kubernetes.conf <<EOF
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
sudo lsmod | grep netfilter
sudo tee /etc/sysctl.d/kubernetes.conf <<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --system
NeedRestart on the control-plane node
Before installing any software, we need to make a tiny change to needrestart program. This will help with the automation of installing packages and will stop asking -via dialog- if we would like to restart the services!
temporarily
export -p NEEDRESTART_MODE="a"
permanently
a more permanent way, is to update the configuration file
echo "$nrconf{restart} = 'a';" | sudo tee -a /etc/needrestart/needrestart.conf
Installing a Container Runtime on the control-plane node
It is time to choose which container runtime we are going to use on our k8s cluster. There are a few container runtimes for k8s and in the past docker were used to. Nowadays the most common runtime is the containerd that can also uses the cgroup v2 kernel features. There is also a docker-engine runtime via CRI. Read here for more details on the subject.
curl -sL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/docker-keyring.gpg
sudo apt-add-repository -y "deb https://download.docker.com/linux/ubuntu oracular stable"
sleep 3
sudo apt-get -y install containerd.io
containerd config default
| sed 's/SystemdCgroup = false/SystemdCgroup = true/'
| sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd.service
You can find the containerd configuration file here:
/etc/containerd/config.toml
In earlier versions of ubuntu we should enable the systemd cgroup driver
.
Recomendation from official documentation is:
It is best to use cgroup v2, use the systemd cgroup driver instead of cgroupfs.
Starting with v1.22 and later, when creating a cluster with kubeadm, if the user does not set the cgroupDriver field under KubeletConfiguration, kubeadm defaults it to systemd.
Installing kubeadm, kubelet and kubectl on the control-plane node
Install the kubernetes packages (kubedam, kubelet and kubectl) by first adding the k8s repository on our virtual machine. To speed up the next step, we will also download the configuration container images.
This guide is using kubeadm, so we need to check the latest version.
Kubernetes v1.31 is the latest version when this guide was written.
VERSION="1.31"
curl -fsSL https://pkgs.k8s.io/core:/stable:/v${VERSION}/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
# allow unprivileged APT programs to read this keyring
sudo chmod 0644 /etc/apt/keyrings/kubernetes-apt-keyring.gpg
# This overwrites any existing configuration in /etc/apt/sources.list.d/kubernetes.list
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v${VERSION}/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
# helps tools such as command-not-found to work correctly
sudo chmod 0644 /etc/apt/sources.list.d/kubernetes.list
sleep 2
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
Get kubernetes admin configuration images
Retrieve the Kubernetes admin configuration images.
sudo kubeadm config images pull
Initializing the control-plane node
We can now proceed with initializing the control-plane node for our Kubernetes cluster.
There are a few things we need to be careful about:
- We can specify the control-plane-endpoint if we are planning to have a high available k8s cluster. (we will skip this for now),
- Choose a Pod network add-on (next section) but be aware that CoreDNS (DNS and Service Discovery) will not run till then (later),
- define where is our container runtime socket (we will skip it)
- advertise the API server (we will skip it)
But we will define our Pod Network CIDR to the default value of the Pod network add-on so everything will go smoothly later on.
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
Keep the output in a notepad.
Create user access config to the k8s control-plane node
Our k8s control-plane node is running, so we need to have credentials to access it.
The kubectl reads a configuration file (that has the token), so we copying this from k8s admin.
rm -rf $HOME/.kube
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
ls -la $HOME/.kube/config
echo 'alias k="kubectl"' | sudo tee -a /etc/bash.bashrc
source /etc/bash.bashrc
Verify the control-plane node
Verify that the kubernets is running.
That means we have a k8s cluster - but only the control-plane node is running.
kubectl cluster-info
# kubectl cluster-info dump
kubectl get nodes -o wide
kubectl get pods -A -o wide
Install an overlay network provider on the control-plane node
As I mentioned above, in order to use the DNS and Service Discovery services in the kubernetes (CoreDNS) we need to install a Container Network Interface (CNI) based Pod network add-on so that your Pods can communicate with each other.
Kubernetes Flannel is a popular network overlay solution for Kubernetes clusters, primarily used to enable networking between pods across different nodes. It’s a simple and easy-to-implement network fabric that uses the VXLAN protocol to create a flat virtual network, allowing Kubernetes pods to communicate with each other across different hosts.
Make sure to open the below udp ports for flannel’s VXLAN traffic (if you are going to use it):
sudo ufw allow 8472/udp
To install Flannel as the networking solution for your Kubernetes (K8s) cluster, run the following command to deploy Flannel:
k apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
Verify CoreDNS is running on the control-plane node
Verify that the control-plane node is Up & Running and the control-plane pods (as coredns pods) are also running
k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8scpnode1 Ready control-plane 12m v1.31.3 192.168.122.223 <none> Ubuntu 24.10 6.11.0-9-generic containerd://1.7.23
k get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-flannel kube-flannel-ds-9v8fq 1/1 Running 0 2m17s 192.168.122.223 k8scpnode1 <none> <none>
kube-system coredns-7c65d6cfc9-dg6nq 1/1 Running 0 12m 10.244.0.2 k8scpnode1 <none> <none>
kube-system coredns-7c65d6cfc9-r4ksc 1/1 Running 0 12m 10.244.0.3 k8scpnode1 <none> <none>
kube-system etcd-k8scpnode1 1/1 Running 0 13m 192.168.122.223 k8scpnode1 <none> <none>
kube-system kube-apiserver-k8scpnode1 1/1 Running 0 12m 192.168.122.223 k8scpnode1 <none> <none>
kube-system kube-controller-manager-k8scpnode1 1/1 Running 0 12m 192.168.122.223 k8scpnode1 <none> <none>
kube-system kube-proxy-sxtk9 1/1 Running 0 12m 192.168.122.223 k8scpnode1 <none> <none>
kube-system kube-scheduler-k8scpnode1 1/1 Running 0 13m 192.168.122.223 k8scpnode1 <none> <none>
That’s it with the control-plane node !
Worker Nodes
The following instructions apply similarly to both worker nodes. I will document the steps for the k8swrknode1 node, but please follow the same process for the k8swrknode2 node.
Ports on the worker nodes
As we learned above on the control-plane section, kubernetes runs a few services
Protocol | Direction | Port Range | Purpose | Used By |
---|---|---|---|---|
TCP | Inbound | 10250 | Kubelet API | Self, Control plane |
TCP | Inbound | 10256 | kube-proxy | Self, Load balancers |
TCP | Inbound | 30000-32767 | NodePort Services | All |
Firewall on the worker nodes
so we need to open the necessary ports on the worker nodes too.
sudo ufw allow 10250/tcp
sudo ufw allow 10256/tcp
sudo ufw allow 30000:32767/tcp
sudo ufw status
The output should appear as follows:
To Action From
-- ------ ----
22/tcp ALLOW Anywhere
10250/tcp ALLOW Anywhere
30000:32767/tcp ALLOW Anywhere
22/tcp (v6) ALLOW Anywhere (v6)
10250/tcp (v6) ALLOW Anywhere (v6)
30000:32767/tcp (v6) ALLOW Anywhere (v6)
and do not forget, we also need to open UDP 8472 for flannel
sudo ufw allow 8472/udp
The next few steps are pretty much exactly the same as in the control-plane node.
In order to keep this documentation short, I’ll just copy/paste the commands.
Hosts file in the worker node
Update the /etc/hosts
file to include the IPs and hostname of all VMs.
192.168.122.223 k8scpnode1
192.168.122.50 k8swrknode1
192.168.122.10 k8swrknode2
No Swap on the worker node
sudo swapoff -a
Kernel modules on the worker node
sudo tee /etc/modules-load.d/kubernetes.conf <<EOF
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
sudo lsmod | grep netfilter
sudo tee /etc/sysctl.d/kubernetes.conf <<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --system
NeedRestart on the worker node
export -p NEEDRESTART_MODE="a"
Installing a Container Runtime on the worker node
curl -sL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/docker-keyring.gpg
sudo apt-add-repository -y "deb https://download.docker.com/linux/ubuntu oracular stable"
sleep 3
sudo apt-get -y install containerd.io
containerd config default
| sed 's/SystemdCgroup = false/SystemdCgroup = true/'
| sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd.service
Installing kubeadm, kubelet and kubectl on the worker node
VERSION="1.31"
curl -fsSL https://pkgs.k8s.io/core:/stable:/v${VERSION}/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
# allow unprivileged APT programs to read this keyring
sudo chmod 0644 /etc/apt/keyrings/kubernetes-apt-keyring.gpg
# This overwrites any existing configuration in /etc/apt/sources.list.d/kubernetes.list
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v${VERSION}/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
# helps tools such as command-not-found to work correctly
sudo chmod 0644 /etc/apt/sources.list.d/kubernetes.list
sleep 3
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
Get Token from the control-plane node
To join nodes to the kubernetes cluster, we need to have a couple of things.
- a token from control-plane node
- the CA certificate hash from the contol-plane node.
If you didnt keep the output the initialization of the control-plane node, that’s okay.
Run the below command in the control-plane node.
sudo kubeadm token list
and we will get the initial token that expires after 24hours.
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
7n4iwm.8xqwfcu4i1co8nof 23h 2024-11-26T12:14:55Z authentication,signing The default bootstrap token generated by 'kubeadm init'. system:bootstrappers:kubeadm:default-node-token
In this case is the
7n4iwm.8xqwfcu4i1co8nof
Get Certificate Hash from the control-plane node
To get the CA certificate hash from the control-plane-node, we need to run a complicated command:
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
and in my k8s cluster is:
2f68e4b27cae2d2a6431f3da308a691d00d9ef3baa4677249e43b3100d783061
Join Workers to the kubernetes cluster
So now, we can Join our worker nodes to the kubernetes cluster.
Run the below command on both worker nodes:
sudo kubeadm join 192.168.122.223:6443
--token 7n4iwm.8xqwfcu4i1co8nof
--discovery-token-ca-cert-hash sha256:2f68e4b27cae2d2a6431f3da308a691d00d9ef3baa4677249e43b3100d783061
we get this message
Run ‘kubectl get nodes’ on the control-plane to see this node join the cluster.
Is the kubernetes cluster running ?
We can verify that
kubectl get nodes -o wide
kubectl get pods -A -o wide
All nodes have successfully joined the Kubernetes cluster
so make sure they are in Ready status.
k8scpnode1 Ready control-plane 58m v1.31.3 192.168.122.223 <none> Ubuntu 24.10 6.11.0-9-generic containerd://1.7.23
k8swrknode1 Ready <none> 3m37s v1.31.3 192.168.122.50 <none> Ubuntu 24.10 6.11.0-9-generic containerd://1.7.23
k8swrknode2 Ready <none> 3m37s v1.31.3 192.168.122.10 <none> Ubuntu 24.10 6.11.0-9-generic containerd://1.7.23
All pods
so make sure all pods are in Running status.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-flannel kube-flannel-ds-9v8fq 1/1 Running 0 46m 192.168.122.223 k8scpnode1 <none> <none>
kube-flannel kube-flannel-ds-hmtmv 1/1 Running 0 3m32s 192.168.122.50 k8swrknode1 <none> <none>
kube-flannel kube-flannel-ds-rwkrm 1/1 Running 0 3m33s 192.168.122.10 k8swrknode2 <none> <none>
kube-system coredns-7c65d6cfc9-dg6nq 1/1 Running 0 57m 10.244.0.2 k8scpnode1 <none> <none>
kube-system coredns-7c65d6cfc9-r4ksc 1/1 Running 0 57m 10.244.0.3 k8scpnode1 <none> <none>
kube-system etcd-k8scpnode1 1/1 Running 0 57m 192.168.122.223 k8scpnode1 <none> <none>
kube-system kube-apiserver-k8scpnode1 1/1 Running 0 57m 192.168.122.223 k8scpnode1 <none> <none>
kube-system kube-controller-manager-k8scpnode1 1/1 Running 0 57m 192.168.122.223 k8scpnode1 <none> <none>
kube-system kube-proxy-49f6q 1/1 Running 0 3m32s 192.168.122.50 k8swrknode1 <none> <none>
kube-system kube-proxy-6qpph 1/1 Running 0 3m33s 192.168.122.10 k8swrknode2 <none> <none>
kube-system kube-proxy-sxtk9 1/1 Running 0 57m 192.168.122.223 k8scpnode1 <none> <none>
kube-system kube-scheduler-k8scpnode1 1/1 Running 0 57m 192.168.122.223 k8scpnode1 <none> <none>
That’s it !
Our k8s cluster is running.
Kubernetes Dashboard
is a general purpose, web-based UI for Kubernetes clusters. It allows users to manage applications running in the cluster and troubleshoot them, as well as manage the cluster itself.
Next, we can move forward with installing the Kubernetes dashboard on our cluster.
Helm
Helm—a package manager for Kubernetes that simplifies the process of deploying applications to a Kubernetes cluster. As of version 7.0.0, kubernetes-dashboard has dropped support for Manifest-based installation. Only Helm-based installation is supported now.
Live on the edge !
curl -sL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
Install kubernetes dashboard
We need to add the kubernetes-dashboard helm repository first and install the helm chart after:
# Add kubernetes-dashboard repository
helm repo add kubernetes-dashboard https://kubernetes.github.io/dashboard/
# Deploy a Helm Release named "kubernetes-dashboard" using the kubernetes-dashboard chart
helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard --create-namespace --namespace kubernetes-dashboard
The output of the command above should resemble something like this:
Release "kubernetes-dashboard" does not exist. Installing it now.
NAME: kubernetes-dashboard
LAST DEPLOYED: Mon Nov 25 15:36:51 2024
NAMESPACE: kubernetes-dashboard
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
*************************************************************************************************
*** PLEASE BE PATIENT: Kubernetes Dashboard may need a few minutes to get up and become ready ***
*************************************************************************************************
Congratulations! You have just installed Kubernetes Dashboard in your cluster.
To access Dashboard run:
kubectl -n kubernetes-dashboard port-forward svc/kubernetes-dashboard-kong-proxy 8443:443
NOTE: In case port-forward command does not work, make sure that kong service name is correct.
Check the services in Kubernetes Dashboard namespace using:
kubectl -n kubernetes-dashboard get svc
Dashboard will be available at:
https://localhost:8443
Verify the installation
kubectl -n kubernetes-dashboard get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes-dashboard-api ClusterIP 10.106.254.153 <none> 8000/TCP 3m48s
kubernetes-dashboard-auth ClusterIP 10.103.156.167 <none> 8000/TCP 3m48s
kubernetes-dashboard-kong-proxy ClusterIP 10.105.230.13 <none> 443/TCP 3m48s
kubernetes-dashboard-metrics-scraper ClusterIP 10.109.7.234 <none> 8000/TCP 3m48s
kubernetes-dashboard-web ClusterIP 10.106.125.65 <none> 8000/TCP 3m48s
kubectl get all -n kubernetes-dashboard
NAME READY STATUS RESTARTS AGE
pod/kubernetes-dashboard-api-6dbb79747-rbtlc 1/1 Running 0 4m5s
pod/kubernetes-dashboard-auth-55d7cc5fbd-xccft 1/1 Running 0 4m5s
pod/kubernetes-dashboard-kong-57d45c4f69-t9lw2 1/1 Running 0 4m5s
pod/kubernetes-dashboard-metrics-scraper-df869c886-lt624 1/1 Running 0 4m5s
pod/kubernetes-dashboard-web-6ccf8d967-9rp8n 1/1 Running 0 4m5s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes-dashboard-api ClusterIP 10.106.254.153 <none> 8000/TCP 4m10s
service/kubernetes-dashboard-auth ClusterIP 10.103.156.167 <none> 8000/TCP 4m10s
service/kubernetes-dashboard-kong-proxy ClusterIP 10.105.230.13 <none> 443/TCP 4m10s
service/kubernetes-dashboard-metrics-scraper ClusterIP 10.109.7.234 <none> 8000/TCP 4m10s
service/kubernetes-dashboard-web ClusterIP 10.106.125.65 <none> 8000/TCP 4m10s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/kubernetes-dashboard-api 1/1 1 1 4m7s
deployment.apps/kubernetes-dashboard-auth 1/1 1 1 4m7s
deployment.apps/kubernetes-dashboard-kong 1/1 1 1 4m7s
deployment.apps/kubernetes-dashboard-metrics-scraper 1/1 1 1 4m7s
deployment.apps/kubernetes-dashboard-web 1/1 1 1 4m7s
NAME DESIRED CURRENT READY AGE
replicaset.apps/kubernetes-dashboard-api-6dbb79747 1 1 1 4m6s
replicaset.apps/kubernetes-dashboard-auth-55d7cc5fbd 1 1 1 4m6s
replicaset.apps/kubernetes-dashboard-kong-57d45c4f69 1 1 1 4m6s
replicaset.apps/kubernetes-dashboard-metrics-scraper-df869c886 1 1 1 4m6s
replicaset.apps/kubernetes-dashboard-web-6ccf8d967 1 1 1 4m6s
Accessing Dashboard via a NodePort
A NodePort is a type of Service in Kubernetes that exposes a service on each node’s IP at a static port. This allows external traffic to reach the service by accessing the node’s IP and port. kubernetes-dashboard by default runs on a internal 10.x.x.x IP. To access the dashboard we need to have a NodePort in the kubernetes-dashboard service.
We can either Patch the service or edit the yaml file.
Choose one of the two options below; there’s no need to run both as it’s unnecessary (but not harmful).
Patch kubernetes-dashboard
This is one way to add a NodePort.
kubectl --namespace kubernetes-dashboard patch svc kubernetes-dashboard-kong-proxy -p '{"spec": {"type": "NodePort"}}'
output
service/kubernetes-dashboard-kong-proxy patched
verify the service
kubectl get svc -n kubernetes-dashboard
output
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes-dashboard-api ClusterIP 10.106.254.153 <none> 8000/TCP 50m
kubernetes-dashboard-auth ClusterIP 10.103.156.167 <none> 8000/TCP 50m
kubernetes-dashboard-kong-proxy NodePort 10.105.230.13 <none> 443:32116/TCP 50m
kubernetes-dashboard-metrics-scraper ClusterIP 10.109.7.234 <none> 8000/TCP 50m
kubernetes-dashboard-web ClusterIP 10.106.125.65 <none> 8000/TCP 50m
we can see the 32116 in the kubernetes-dashboard.
Edit kubernetes-dashboard Service
This is an alternative way to add a NodePort.
kubectl edit svc -n kubernetes-dashboard kubernetes-dashboard-kong-proxy
and chaning the service type from
type: ClusterIP
to
type: NodePort
Accessing Kubernetes Dashboard
The kubernetes-dashboard has two (2) pods, one (1) for metrics, one (2) for the dashboard.
To access the dashboard, first we need to identify in which Node is running.
kubectl get pods -n kubernetes-dashboard -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kubernetes-dashboard-api-56f6f4b478-p4xbj 1/1 Running 0 55m 10.244.2.12 k8swrknode1 <none> <none>
kubernetes-dashboard-auth-565b88d5f9-fscj9 1/1 Running 0 55m 10.244.1.12 k8swrknode2 <none> <none>
kubernetes-dashboard-kong-57d45c4f69-rts57 1/1 Running 0 55m 10.244.2.10 k8swrknode1 <none> <none>
kubernetes-dashboard-metrics-scraper-df869c886-bljqr 1/1 Running 0 55m 10.244.2.11 k8swrknode1 <none> <none>
kubernetes-dashboard-web-6ccf8d967-t6k28 1/1 Running 0 55m 10.244.1.11 k8swrknode2 <none> <none>
In my setup the dashboard pod is running on the worker node 1 and from the /etc/hosts
is on the 192.168.122.50 IP.
The NodePort is 32116
k get svc -n kubernetes-dashboard -o wide
So, we can open a new tab on our browser and type:
https://192.168.122.50:32116
and accept the self-signed certificate!
Create An Authentication Token (RBAC)
Last step for the kubernetes-dashboard is to create an authentication token.
Creating a Service Account
Create a new yaml file, with kind: ServiceAccount that has access to kubernetes-dashboard namespace and has name: admin-user.
cat > kubernetes-dashboard.ServiceAccount.yaml <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
name: admin-user
namespace: kubernetes-dashboard
EOF
add this service account to the k8s cluster
kubectl apply -f kubernetes-dashboard.ServiceAccount.yaml
output
serviceaccount/admin-user created
Creating a ClusterRoleBinding
We need to bind the Service Account with the kubernetes-dashboard via Role-based access control.
cat > kubernetes-dashboard.ClusterRoleBinding.yaml <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: admin-user
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: admin-user
namespace: kubernetes-dashboard
EOF
apply this yaml file
kubectl apply -f kubernetes-dashboard.ClusterRoleBinding.yaml
clusterrolebinding.rbac.authorization.k8s.io/admin-user created
That means, our Service Account User has all the necessary roles to access the kubernetes-dashboard.
Getting a Bearer Token
Final step is to create/get a token for our user.
kubectl -n kubernetes-dashboard create token admin-user
eyJhbGciOiJSUzI1NiIsImtpZCI6IlpLbDVPVFQxZ1pTZlFKQlFJQkR6dVdGdGpvbER1YmVmVmlJTUd5WEVfdUEifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzMyNzI0NTQ5LCJpYXQiOjE3MzI3MjA5NDksImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwianRpIjoiMTczNzQyZGUtNDViZi00NjhkLTlhYWYtMDg3MDA3YmZmMjk3Iiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsInNlcnZpY2VhY2NvdW50Ijp7Im5hbWUiOiJhZG1pbi11c2VyIiwidWlkIjoiYWZhZmNhYzItZDYxNy00M2I0LTg2N2MtOTVkMzk5YmQ4ZjIzIn19LCJuYmYiOjE3MzI3MjA5NDksInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDphZG1pbi11c2VyIn0.AlPSIrRsCW2vPa1P3aDQ21jaeIU2MAtiKcDO23zNRcd8-GbJUX_3oSInmSx9o2029eI5QxciwjduIRdJfTuhiPPypb3tp31bPT6Pk6_BgDuN7n4Ki9Y2vQypoXJcJNikjZpSUzQ9TOm88e612qfidSc88ATpfpS518IuXCswPg4WPjkI1WSPn-lpL6etrRNVfkT1eeSR0fO3SW3HIWQX9ce-64T0iwGIFjs0BmhDbBtEW7vH5h_hHYv3cbj_6yGj85Vnpjfcs9a9nXxgPrn_up7iA6lPtLMvQJ2_xvymc57aRweqsGSHjP2NWya9EF-KBy6bEOPB29LaIaKMywSuOQ
Add this token to the previous login page
Browsing Kubernetes Dashboard
eg. Cluster –> Nodes
Nginx App
Before finishing this blog post, I would also like to share how to install a simple nginx-app as it is customary to do such thing in every new k8s cluster.
But plz excuse me, I will not get into much details.
You should be able to understand the below k8s commands.
Install nginx-app
kubectl create deployment nginx-app --image=nginx --replicas=2
deployment.apps/nginx-app created
Get Deployment
kubectl get deployment nginx-app -o wide
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
nginx-app 2/2 2 2 64s nginx nginx app=nginx-app
Expose Nginx-App
kubectl expose deployment nginx-app --type=NodePort --port=80
service/nginx-app exposed
Verify Service nginx-app
kubectl get svc nginx-app -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
nginx-app NodePort 10.98.170.185 <none> 80:31761/TCP 27s app=nginx-app
Describe Service nginx-app
kubectl describe svc nginx-app
Name: nginx-app
Namespace: default
Labels: app=nginx-app
Annotations: <none>
Selector: app=nginx-app
Type: NodePort
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.98.170.185
IPs: 10.98.170.185
Port: <unset> 80/TCP
TargetPort: 80/TCP
NodePort: <unset> 31761/TCP
Endpoints: 10.244.1.10:80,10.244.2.10:80
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
Curl Nginx-App
curl http://192.168.122.8:31761
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>
<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>
<p><em>Thank you for using nginx.</em></p>
</body>
</html>
Nginx-App from Browser
Change the default page
Last but not least, let’s modify the default index page to something different for educational purposes with the help of a ConfigMap
The idea is to create a ConfigMap with the html of our new index page then we would like to attach it to our nginx deployment as a volume mount !
cat > nginx_config.map << EOF
apiVersion: v1
data:
index.html: |
<!DOCTYPE html>
<html lang="en">
<head>
<title>A simple HTML document</title>
</head>
<body>
<p>Change the default nginx page </p>
</body>
</html>
kind: ConfigMap
metadata:
name: nginx-config-page
namespace: default
EOF
cat nginx_config.map
apiVersion: v1
data:
index.html: |
<!DOCTYPE html>
<html lang="en">
<head>
<title>A simple HTML document</title>
</head>
<body>
<p>Change the default nginx page </p>
</body>
</html>
kind: ConfigMap
metadata:
name: nginx-config-page
namespace: default
apply the config.map
kubectl apply -f nginx_config.map
verify
kubectl get configmap
NAME DATA AGE
kube-root-ca.crt 1 2d3h
nginx-config-page 1 16m
now the diffucult part, we need to mount our config map to the nginx deployment and to do that, we need to edit the nginx deployment.
kubectl edit deployments.apps nginx-app
rewrite spec section to include:
- the VolumeMount &
- the ConfigMap as Volume
spec:
containers:
- image: nginx
...
volumeMounts:
- mountPath: /usr/share/nginx/html
name: nginx-config
...
volumes:
- configMap:
name: nginx-config-page
name: nginx-config
After saving, the nginx deployment will be updated by it-self.
finally we can see our updated first index page:
That’s it
I hope you enjoyed this post.
-Evaggelos Balaskas
destroy our lab
./destroy.sh
...
libvirt_domain.domain-ubuntu["k8wrknode1"]: Destroying... [id=446cae2a-ce14-488f-b8e9-f44839091bce]
libvirt_domain.domain-ubuntu["k8scpnode"]: Destroying... [id=51e12abb-b14b-4ab8-b098-c1ce0b4073e3]
time_sleep.wait_for_cloud_init: Destroying... [id=2022-08-30T18:02:06Z]
libvirt_domain.domain-ubuntu["k8wrknode2"]: Destroying... [id=0767fb62-4600-4bc8-a94a-8e10c222b92e]
time_sleep.wait_for_cloud_init: Destruction complete after 0s
libvirt_domain.domain-ubuntu["k8wrknode1"]: Destruction complete after 1s
libvirt_domain.domain-ubuntu["k8scpnode"]: Destruction complete after 1s
libvirt_domain.domain-ubuntu["k8wrknode2"]: Destruction complete after 1s
libvirt_cloudinit_disk.cloud-init["k8wrknode1"]: Destroying... [id=/var/lib/libvirt/images/Jpw2Sg_cloud-init.iso;b8ddfa73-a770-46de-ad16-b0a5a08c8550]
libvirt_cloudinit_disk.cloud-init["k8wrknode2"]: Destroying... [id=/var/lib/libvirt/images/VdUklQ_cloud-init.iso;5511ed7f-a864-4d3f-985a-c4ac07eac233]
libvirt_volume.ubuntu-base["k8scpnode"]: Destroying... [id=/var/lib/libvirt/images/l5Rr1w_ubuntu-base]
libvirt_volume.ubuntu-base["k8wrknode2"]: Destroying... [id=/var/lib/libvirt/images/VdUklQ_ubuntu-base]
libvirt_cloudinit_disk.cloud-init["k8scpnode"]: Destroying... [id=/var/lib/libvirt/images/l5Rr1w_cloud-init.iso;11ef6bb7-a688-4c15-ae33-10690500705f]
libvirt_volume.ubuntu-base["k8wrknode1"]: Destroying... [id=/var/lib/libvirt/images/Jpw2Sg_ubuntu-base]
libvirt_cloudinit_disk.cloud-init["k8wrknode1"]: Destruction complete after 1s
libvirt_volume.ubuntu-base["k8wrknode2"]: Destruction complete after 1s
libvirt_cloudinit_disk.cloud-init["k8scpnode"]: Destruction complete after 1s
libvirt_cloudinit_disk.cloud-init["k8wrknode2"]: Destruction complete after 1s
libvirt_volume.ubuntu-base["k8wrknode1"]: Destruction complete after 1s
libvirt_volume.ubuntu-base["k8scpnode"]: Destruction complete after 2s
libvirt_volume.ubuntu-vol["k8wrknode1"]: Destroying... [id=/var/lib/libvirt/images/Jpw2Sg_ubuntu-vol]
libvirt_volume.ubuntu-vol["k8scpnode"]: Destroying... [id=/var/lib/libvirt/images/l5Rr1w_ubuntu-vol]
libvirt_volume.ubuntu-vol["k8wrknode2"]: Destroying... [id=/var/lib/libvirt/images/VdUklQ_ubuntu-vol]
libvirt_volume.ubuntu-vol["k8scpnode"]: Destruction complete after 0s
libvirt_volume.ubuntu-vol["k8wrknode2"]: Destruction complete after 0s
libvirt_volume.ubuntu-vol["k8wrknode1"]: Destruction complete after 0s
random_id.id["k8scpnode"]: Destroying... [id=l5Rr1w]
random_id.id["k8wrknode2"]: Destroying... [id=VdUklQ]
random_id.id["k8wrknode1"]: Destroying... [id=Jpw2Sg]
random_id.id["k8wrknode2"]: Destruction complete after 0s
random_id.id["k8scpnode"]: Destruction complete after 0s
random_id.id["k8wrknode1"]: Destruction complete after 0s
Destroy complete! Resources: 16 destroyed.
Friday, 08 November 2024
KDE Gear 24.12 branches created
Make sure you commit anything you want to end up in the KDE Gear 24.12
releases to them
Next Dates:
- November 14, 2024: 24.12 freeze and beta (24.11.80) tagging and release
- November 28, 2024: 24.12 RC (24.11.90) tagging and release
- December 5, 2024: 24.12 tagging
- December 12, 2024: 24.12 release
Thursday, 07 November 2024
INWX DNS Recordmaster - Manage your DNS nameserver records via files in Git
I own and manage 30+ domains at INWX, a large and professional domain registrar. Although INWX has a somewhat decent web interface, it became a burden for me to keep an overview of each domain’s sometimes dozens of records. Especially when e.g. changing an IP address for more than one domain, it caused multiple error-prone clicks and copy/pastes that couldn’t be reverted in the worst case. This is why I created INWX DNS Recordmaster which I will shortly present here.
If you are an INWX customer, you can use this tool to manage all your DNS records in YAML files. Ideally, you will store these files in a Git repository which you can use to track changes and roll back in case of a mistake. Having one file per domain provides you a number of further advantages:
- You can easily copy/paste records from other domains, e.g. for
SPF
,DKIM
orNS
records - Overall search/replace of certain values becomes much easier, e.g. of IP addresses
- You can prepare larger changes offline and can synchronise once you feel it’s done
INWX DNS Recordmaster takes care of making the required changes of the live records so that it matches the local state. This is done via the INWX API, ensuring that the amount of API calls is minimal.
This even allows you to set up a pipeline that takes care of the synchronisation1.
Wait, there is more
As written above, I already had a large stack of domains that I previously managed via the web interface. This is why some additional convenience features found their way into the tool.
- You can convert all records of an existing and already configured domain at INWX into the file format. This made onboarding my 30+ domains a matter of a few minutes.
- On a global or per-domain level, you can ignore certain record types. For example, if you don’t want to touch any
NS
records, you can configure that. By default,SOA
records are ignored. You may even ignore all live records that don’t exist in your local configuration. - Of course, you can make a dry run to see which effects your configuration will have in practice.
Did I miss something to make it more productive for you? Let me know!
Install, use, contribute
You are welcome to install this tool, it’s Free and Open Source Software after all. All you need is Python installed.
One of the tool’s users is the OpenRail Association which manages some of its domains with this program and published its configuration. This is a prime example of how organisation can make the management of records transparent and easy to change at least internally, if not even externally.
While the tool is not perfect, it already is a huge gain for efficiency and stability of my IT operations, and it already proves its capabilities for other users. To reach the remaining 20% to perfection (that will take 80% of the time, as always), you are most welcome to add issues with enhancement proposals, and if possible, also pull requests.
-
For example, see the workflow file of the OpenRail Association. ↩︎
Tuesday, 05 November 2024
Music production with Linux: How to use Guitarix and Ardour together
Music production for guitar has a lot of options on Linux. We will see how to install the required software, and how to use Guitarix together with Ardour either with the standalone version of Guitarix or with an embedded version inside Ardour.
Software installation and configuration
Install Ardour, a music production software under the GPLv2 license. For Archlinux run:
sudo pacman -S ardour
For other operating systems you can follow the Ardour installation page or on flathub.
Install qpwgraph to visualize pipewire connections. So this is not mandatory but highly recommended to make sure Ardour, Guitarix and their respective inputs and outputs are wired correctly.
sudo pacman -S qpwgraph
Make sure your user is in the audio and realtime groups:
sudo usermod -a -G audio $USER
sudo usermod -a -G realtime $USER
and set the real time priority and memory of the audio group in /etc/security/limits.d/audio.conf
:
@audio - rtprio 95
@audio - memlock unlimited
Start Ardour, select “Recording Session” and select only one audio input.
Guitarix as a standalone program
We will first see how to use Guitarix as a standalone program. Guitarix is a virtual amplifier released under the GPLv2 license which uses Jack to add audio effects to a raw guitar signal from a microphone or guitar pickup.
To install Guitarix on Archlinux run:
sudo pacman -S guitarix
Other installation instructions are available on the Guitarix installation page or on flathub.
Starting Guitarix shows the main window. The left panels shows the effects available, which can be dragged onto the main panel to put them on the rack and change their settings.

Guitarix main window.
To configure Guitarix’s input and output, go to the “Engine” menu and click on “Jack Ports”. The inputs should be the guitar pickup and microphone, and the output should be Ardour “audio_in”. Make sure Ardour is started so that it can be selected in the output section.

Jack Input and Output selection in guitarix.
The Guitarix output configuration can be checked on the Ardour side as well. In
Ardour, select the “Rec” tab (with the button in the top right corner) and
choose the routing grid option using the third button of the “Audio 1” row. This
will display a routing grid where you can check whether only the output of Guitarix
gx_head_fx
is selected.

Routing of Audio 1 where the guitarix output is selected.
The jack graph of this setup will see the guitar pickup or microphone connected
to Guitarix, the Guitarix output connected to Ardour, and the Ardour output
connected to the system’s playback. The graph from qpwgraph
below illustrates this
configuration and allows checking for feedback loops and incorrect connections.

Jack graph connection of guitarix and ardour.
To record the Ardour output, press the red recording button in “Audio 1” row of the “Rec” tab. To monitor the audio that will be recorded (i.e. the Guitarix output), you can press the “In” button.

Audio 1 channel in the ‘Rec’ tab.
Guitarix supports Neural Amp Modeler (NAM) plugins to emulate any hardware amplifier, pedal or impulse responses. NAM models can be downloaded on ToneHunt and loaded under the “Neural” section in the pool tab.
Guitarix as a plugin inside Ardour
Guitarix exists as a VST3 plugin for music production software. The plugin shares its configuration with the standalone Guitarix app, so Guitarix presets and settings from the standalone app are available in the plugin.
Install the plugin on Archlinux from AUR:
paru -S guitarix.vst
or head to the project repository for builds for other operating systems.
To load Guitarix as an Ardour plugin, go to the “Mix” tab (in to top right corner), then right-click on the black area below the fader and select “New Plugin” and “Plugin Selector”. The “Guitarix” plugin can be inserted on the newly opened window. Double-clicking on Guitarix open the plugin window, which roughly looks like the standalone program. Effects can be added using the “plus” symbol next to the input and AMP stack boxes. Community-made presets can also be downloaded using the “Online” button.

Guiatix plugin within ardour.
If Guitarix is used within Ardour as a plugin, the Ardour input (i.e. in this example the microphone) must be selected in the Routing grid of the audio track. The jack graph of this setup looks simpler, as the microphone is directly connected to the Ardour audio track.

Jack graph of Ardour without Guitarix.
Record and export the recordings
To do recordings, go the “Rec” tab and make sure the audio track has the red “record” button checked. Then go to the “Edit” tab, click on the global “Toggle record” button, hit “Play from playhead” and there goes the music!
To export the recordings, go to the “session” menu and go to “Export” and “Export to file”. On the Export dialog, select the right file format, time span and channels and click on export.

Export dialog in Ardour with channel selection.
Friday, 01 November 2024
Relaxed RSS with sfeed
While the golden age of huge platforms offering RSS feeds may be over, lots of webpages still have feeds, especially niche blogs. Over the years, I have tried many RSS readers, but have always given up because they did not work for me. However, by switching to the more obsucre sfeed, I found a workflow that worked (and flowed) for me.
Why RSS and what for?
But before getting into it, I think I need to clarify what I use RSS for. And just to say it once, this post does not distinguishes between RSS and Atom, as the difference is a technicality. To stay on that technical level just for a moment, RSS is only a file format to describe feeds. Major news outlets use it to announce multiple news items per hour, some social media sites allow each post to be read via RSS, code forges publish releases or even commits via RSS, and it is the secret ingredient in the podcasting sauce.
The bottom line is that RSS may be out of sight, but it is still everywhere. But if one is going to subscribe to everything, one might end up with hundreds or thousands of posts in no time. This would be, for me at least, an information overload, resulting in stop reading. Besides, for daily news, I just go to one or several of the big news sites anyway.
So I mainly subscribe to smaller private blogs or niche news sites with a few articles per month. Supplemented by a careful selection of meta-aggregators and mailing list archives, both further filtered. At the moment, there must be round about 130 feeds. This number is growing slowly but steadily as I find something new more often than I decide to drop a source. Depending on the day of the week, this results in about twenty posts each day.
User Experience Matters
In my experience, most RSS readers follow an inbox architecture that expects me to interact with each new post (or to give up and mark them all as read). So even after pre-filtering, the inbox would fill up quickly. While this inbox concept works well for things which are actually important, like the one in a hundred email you need to respond to, it makes reading feeds uncomfortably stressful, like working against an ever-growing to-do list. This is an information overload and a FOMO scenario all over again.
Using sfeed and some degree of customization, I was able to built myself a private news feed. This feed is generated daily and is then completely static, there is no unread counter and if I skip some days, I can read back, but no software bullies me into it.
In particular, there are two (or three) ways how I consume my feeds. First, there is a custom web feed that shows the last 256 entries, grouped and ordered by date - called the bytefeed. This has become one of the first pages I open in the morning, usually before checking to see if anything of world importance has happened. From there, I can follow each aggregated feed to my archive, listing all known feeds, also as a webpage.

Exemplary output of sfeed-bytefeed, in dark and non-dark mode.
Furthermore, the previous day’s posts are also sent to a private IRC channel. This may seem a bit quirky, but personally I am an excessive IRC user, not only for communication, but also as a message broker. Using my favorite IRC client WeeChat, I am able to receive both monitoring and news events both in WeeChat itself and on my smartphone.
00:31 -- Notice(xdf) -> #feeds: HTB: Editorial https://0xdf.gitlab.io/2024/10/19/htb-editorial.html
00:31 -- Notice(analognow) -> #feeds: unpatchable fourth wall breaking sentience https://analognowhere.com/_/ghtmnt
00:31 -- Notice(analognow) -> #feeds: illegal book fair https://analognowhere.com/_/rnituc
00:31 -- Notice(dragasitn) -> #feeds: Outdated Infrastructure and the Cloud Illusion https://it-notes.dragas.net/2024/10/19/outdated-infrastructure-and-the-cloud-illusion/
00:31 -- Notice(fsfeplane) -> #feeds: TSDgeos' blog: KDE Gear 24.12 release schedule https://tsdgeos.blogspot.com/2024/10/kde-gear-2412-release-schedule.html
00:31 -- Notice(grumpyweb) -> #feeds: nikitonsky is being grumpy https://grumpy.website/1582
[...]
As initially stated, the user experience of a tool matters a lot. While this particular UX may not be appealing to some (or most), it does a pretty decent job for me: Getting daily news digests without having to work through them or actively interact with an “app”.
sfeed?
So far I have only talked about my fear of software and my obsessions with workflows. But what exactly is sfeed?
In a nutshell, sfeed is a collection of small tools to convert RSS feeds into a TAB-separated value (TSV) file and then to present these TSVs in another human or machine-friendly way. These tools are written either in C or as portable POSIX shell scripts, so they can be used on most operating systems. Everything comes with a well-written man page and an exhausting README.
While this may sound boring at first, sfeed’s simple architecture makes it easy to build a custom RSS reader. Representing feeds as a TSV instead of some weird XML allows writing filters or further output generators in almost any script language, like awk.
While sfeed is not restricted to be used only on servers, I configured a pipeline like the following to be run nightly via cron on a small VM:
- Call
sfeed_update
to refresh all RSS feeds configured. - Create two HTML files served by httpd: the feed archive via
sfeed_html
and the bytefeed shown above via a customsfeed_bytefeed
script. - Send day’s posts to the IRC via my custom
sfeed_irc
script.
All this is being triggered from one small script, being configured as a cron job. Afterwards, everything is static until the next iteration starts.
How the sfeed sausage is made
To get started, the essential sfeed tools are required.
Several package managers provide a sfeed
package.
Otherwise, compiling sfeed yourself should be a walk in a park due to the minimal dependencies.
In an attempt to make it all less abstract, I will add commands usable on OpenBSD. A certain degree of cognitive flexibility is assumed.
user@openbsd:~> doas pkg_add sfeed
While most of the sfeed tools come with a pledge(2)
promise by default, sanity and reason recommend creating a custom user.
user@openbsd:~> doas useradd -g =uid -m -s /sbin/nologin _rss
After switching to this unprivileged _rss
user, building on the sfeed example configuration is a good starting point.
One can either find an example installed by your package manager or upstream.
_rss@openbsd:~> mkdir ~/.sfeed
_rss@openbsd:~> cp \
/usr/local/share/examples/sfeed/sfeedrc.example \
~/.sfeed/sfeedrc
A quick look at the .sfeed/sfeedrc
file along with skimming sfeedrc(5)
should explain the basics.
In a nutshell, the feeds
function contains multiple feed
function calls, representing the remote RSS feeds to be fetched.
How the fetching is done can be overridden by a custom fetch
function.
In particular, a custom filter
function allows manipulating the fetched data, but more on that later.
To demonstrate this further, let’s start with this very minimal .sfeed/sfeedrc
and fetch the feeds via sfeed_update(1)
.
_rss@openbsd:~> cat .sfeed/sfeedrc
feeds() {
feed "undeadly" "https://undeadly.org/cgi?action=rss"
feed "xkcd" "https://xkcd.com/rss.xml"
}
_rss@openbsd:~> sfeed_update
[20:25:48] xkcd OK
[20:25:48] undeadly OK
Each feed is stored in its own file, represented by the feed name, within ~/.sfeed/feeds
.
So you might end up with something like this.
As mentioned above, sfeed works on TSV, and these files are already in the TAB-separated format described in sfeed(1)
.
For example, one can extract the titles of each XKCD.
_rss@openbsd:~> ls -l .sfeed/feeds/
-rw-r--r-- 1 _rss _rss 13727 Oct 27 20:25 undeadly
-rw-r--r-- 1 _rss _rss 1740 Oct 27 20:25 xkcd
_rss@openbsd:~> awk -F '\t' '{ print $2 }' .sfeed/feeds/xkcd
Sandwich Helix
RNAWorld
Temperature Scales
Experimental Astrophysics
But in most cases there is no need to manually inspect a feed file.
There is a sfeed tool for that!
sfeed_plain(1)
gives a nice terminal listing while sfeed_html(1)
creates an HTML rendered output.
This works both for single or multiple feeds.
_rss@openbsd:~> sfeed_plain .sfeed/feeds/xkcd
2024-10-25 06:00 xkcd Sandwich Helix https://xkcd.com/3003/
2024-10-23 06:00 xkcd RNAWorld https://xkcd.com/3002/
2024-10-21 06:00 xkcd Temperature Scales https://xkcd.com/3001/
2024-10-18 06:00 xkcd Experimental Astrophysics https://xkcd.com/3000/
_rss@openbsd:~> sfeed_html .sfeed/feeds/*
<!DOCTYPE HTML>
<html>
[. . .]
Custom Scripts
While sfeed comes with a multitude of tools to build your RSS reader with, it is a very hackable ecosystem, as both mentioned and demonstrated.
To build my RSS reader, the following tools have emerged over time.
Some of them were C programs derived from sfeed_plain
, but to celebrate this year’s awktober, they were rewritten in awk.
For this blog post, I have cleaned up my local clone of the sfeed repository and moved them to sfeed-contrib.
There are two types of script in this repo: custom sfeed formatter like sfeed_bytefeed
and helper scripts, especially for automation.
_rss@openbsd:~> git clone https://codeberg.org/oxzi/sfeed-contrib.git
_rss@openbsd:~> ls -l sfeed-contrib/
total 32
drwxr-xr-x 2 _rss _rss 512 Oct 26 22:35 LICENSES
-rw-r--r-- 1 _rss _rss 1995 Oct 26 22:35 README.md
-rwxr-xr-x 1 _rss _rss 1858 Oct 26 22:35 sfeed_bytefeed
-rwxr-xr-x 1 _rss _rss 307 Oct 26 22:35 sfeed_edit
-rwxr-xr-x 1 _rss _rss 1405 Oct 26 22:35 sfeed_irc
-rwxr-xr-x 1 _rss _rss 1104 Oct 26 22:35 sfeed_run
-rwxr-xr-x 1 _rss _rss 192 Oct 26 22:35 sfeed_test
-rw-r--r-- 1 _rss _rss 1169 Oct 26 22:35 style.css
Both sfeed_bytefeed
and sfeed_irc
have already been roughly explained above, but I am going to repeat myself.
They take sfeed feed files as parameters and create an HTML feed of the last 256 entries or post the posts of the day to an IRC channel, respectively.
The style.css
file is a customization of the upstream stylesheet to support both sfeed_html
and sfeed_bytefeed
.
The entire workflow of updating feeds and utilizing formatters was glued together in sfeed_run
.
This script does exactly that, after some preflight checks for the configuration expected in the .sfeed/sfeedrc
.
In order to share this script (command listing would be the better word) with the world, some potentially sensitive values have been moved elsewhere.
Thus, sfeed_run
expects sfeedpath
to be set according to sfeedrc(5)
, sfeedwwwroot
to point to a directory where the _rss
user has write access to dump the HTML files and a gzipped version, and sfeedirchost
and sfeedircport
to point to an open IRC server.
_rss@openbsd:~> head -n 4 .sfeed/sfeedrc
sfeedpath="$HOME/.sfeed/feeds"
sfeedwwwroot="/var/www/htdocs/rss.example.internal"
sfeedirchost="irc.example.internal"
sfeedircport="6667"
Since sfeed_run
executes the whole pipeline, it is run as a nightly cron job.
As some feeds may fail, the output - sfeed_run
only prints errors - is be sent to another user who may log in from time to time.
_rss@openbsd:~> crontab -l
MAILTO=user
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin
30 0 * * * /home/_rss/sfeed-contrib/sfeed_run
Configuring sfeed
While a .sfeed/sfeedrc
file may look as short as posted as the example above, it can grow over time.
However, the most common changes are updated feed
entries in the feeds
function.
To verify if a URL to some feed works, I built the little sfeed_test
helper script.
It sources the .sfeed/sfeedrc
file and uses the fetch
function to retrieve the remote content.
For information on fetch
, consult sfeedrc(5)
.
In the example above, no such function was defined, resulting in curl
to be used.
In my setup, I a use OpenBSD’s ftp(1)
command, which is also present in the example configuration of the OpenBSD port.
_rss@openbsd:~> grep -A 2 '^fetch' .sfeed/sfeedrc
fetch() {
ftp -M -V -w 15 -o - "$2"
}
This allows a simple test like the following, where the latest posts will be shown, formatted with sfeed_plain(1)
.
_rss@openbsd:~> ./sfeed-contrib/sfeed_test 'https://lobste.rs/t/openbsd.rss'
2024-10-13 11:14 OpenBSD is Hard to Show Off https://atthis.link/blog/2024/16379.html
2024-10-07 22:35 OpenBSD 7.6 https://www.openbsd.org/76.html
2024-10-04 15:20 I Solve Problems https://it-notes.dragas.net/2024/10/03/i-solve-problems-eurobsdcon/
[. . .]
Thus, after verifying that the URL to a new feed entry actually works, I use sfeed_edit
to edit the .sfeed/sfeedrc
file.
Again, this script is mostly a wrapper around opening the configuration file with vim
.
This script’s magic is keeping track of configuration file changes via rcs(1)
- yes, the Revision Control System, that single-file CVS thingy.
If changes have been detected, one must commit them and sfeed_run
is executed, otherwise, nothing happens.
The rcs(1)
part is mostly there because I personally like to put my configurations into some version control system.
Doing so gives me backups, at least to a some extent, and a wonderful history via rlog(1)
.
Filtering and Transforming
Initially mentioned, I prefer to filter some feeds.
Fortunately, sfeed supports this via sfeedrc(5)
’s filter
function, which is described as follows.
filter(name, url)
Filter sfeed(5) data from stdin and write it to stdout, its
arguments are:
name Feed name.
url URL of the feed.
Thus, this function is called on every feed, receiving all incoming TSV data on stdin and continuing with the TSV data sent back to stdout. This architecture allows stream-based filtering, even chaining multiple filters. However, since the output is in the domain of this function, it also allows rewriting of feeds, e.g., to enrich titles.
Helpful filters, at least for me, were to remove posts on mixed announcement and discussion mailing lists, to show only the original post. When subscribing to meta-aggregators, it was useful to prefix the title with the origin, based on the URL. In the past, I have subscribed to a YouTube channel where I was only interested in certain videos, allowing to drop others based on the title.
Since these are just some ideas of what is possible, maybe posting a stripped down version of my current filter
will make it more concrete.
# filter(name, url)
filter() {
case "$1" in
"freifunk community news")
# Prefix title with domain
awk '
BEGIN { FS=OFS="\t" }
match($3, /:\/\/[a-z0-9.-]+\//) {
$2 = "[" substr($3, RSTART+3, RLENGTH-4) "] " $2
print $0
}
' ;;
"oss-sec")
# Drop replies; first posts only.
awk -F '\t' '$2 !~ /^Re: /' ;;
*)
cat ;;
esac |\
# Use URL as title if title is missing.
awk 'BEGIN { FS=OFS="\t" } { sub(/^$/, $3, $2); print $0 }' |\
# Prefix YouTube links with [VIDEO].
awk 'BEGIN { FS=OFS="\t" } $3 ~ /https:\/\/www.youtube/ { $2 = "[VIDEO] " $2} //'
}
But why?
Right now, this blog post clocks in at over 2k words. That is a lot of text to describe how to use an RSS reader. Why even bother?
As I said at the beginning, good software should not get in your way. Not only should it work, but it should work for you, without forcing you to bend to its design. This is not limited to RSS readers, of course.
However, in the RSS domain, sfeed has done the trick for me. So I just wanted to say “thank you” for this wonderful piece of software. I also wanted to showcase how to get started with it and how easy it is to extend it. If the normal RSS reader workflow does not work for you, I would encourage you to give sfeed a shot.
Friday, 18 October 2024
KDE Gear 24.12 release schedule
This is the release schedule the release team agreed on
https://community.kde.org/Schedules/KDE_Gear_24.12_Schedule
Dependency freeze is in around 3 weeks (November 7) and feature freeze one
after that. Get your stuff ready!
Monday, 07 October 2024
Google Summer of Code Mentor Summit 2024
This weekend "The KDE Alberts"[1] attended Google Summer of Code Mentor Summit 2024 in Sunnyvale, California.
The Google Summer of Code Mentor Summit is an annual
unconference that every project participating in Google Summer of Code
2024 is invited to attend. This year it was the 20th year celebration of the program!
I was too late to take a picture of the full cake!
We attended many sessions ranging from how to try to avoid falling into the "xz problem" to collecting donations or shaping the governance of open source projects.
We met lots of people that knew what KDE was and were happy to congratulate us on the job done and also a few that did not know KDE and were happy to learn about what we do.
We also did a quick lightning talk about the GSOC projects KDE mentored this year and led two sessions: one centered around the problems some open source application developers are having publishing to the Google Play Store and another session about Desktop Linux together with our Gnome friends.
All in all a very productive unconference. We encourage KDE mentors to take the opportunity to attend the Google Summer of Code Mentor Summit next year, it's a great experience!
[1] me and Albert Vaca, people were moderately amused that both of us had the same name, contribute to the same community and are from the same city.
Wednesday, 02 October 2024
SSH Hardening Ubuntu 24.04 LTS
Personal notes on hardening an new ubuntu 24.04 LTS ssh daemon setup for incoming ssh traffic.
Port <12345>
PasswordAuthentication no
KbdInteractiveAuthentication no
UsePAM yes
X11Forwarding no
PrintMotd no
UseDNS no
KexAlgorithms sntrup761x25519-sha512@openssh.com,curve25519-sha256,curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256
HostKeyAlgorithms ssh-ed25519-cert-v01@openssh.com,ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,sk-ssh-ed25519-cert-v01@openssh.com,sk-ecdsa-sha2-nistp256-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-ed25519,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ssh-ed25519@openssh.com,sk-ecdsa-sha2-nistp256@openssh.com,rsa-sha2-512,rsa-sha2-256
MACs umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512
AcceptEnv LANG LC_*
AllowUsers <username>
Subsystem sftp /usr/lib/openssh/sftp-server
testing with https://sshcheck.com/
Planet FSFE (en): RSS 2.0 |
Atom |
FOAF |
0x21
Albrechts Blog
Alessandro's blog
Andrea Scarpino's blog
André Ockers on Free Software
Bela's Internship Blog
Bernhard's Blog
Bits from the Basement
Blog of Martin Husovec
Bobulate
Brian Gough’s Notes
Chris Woolfrey — FSFE UK Team Member
Ciarán’s free software notes
Colors of Noise - Entries tagged planetfsfe
Communicating freely
Daniel Martí's blog
David Boddie - Updates (Full Articles)
ENOWITTYNAME
English Planet – Dreierlei
English – Abandoned blog
English – Alessandro at FSFE
English – Alina Mierlus – Building the Freedom
English – Being Fellow #952 of FSFE
English – Blog
English – FSFE supporters Vienna
English – Free Software for Privacy and Education
English – Free speech is better than free beer
English – Nicolas Jean's FSFE blog
English – Paul Boddie's Free Software-related blog
English – The Girl Who Wasn't There
English – Thinking out loud
English – Viktor's notes
English – With/in the FSFE
English – gollo's blog
English – mkesper's blog
English – nico.rikken’s blog
Escape to freedom
Evaggelos Balaskas - System Engineer
FSFE interviews its Fellows
FSFE – Frederik Gladhorn (fregl)
FSFE – Matej's blog
Fellowship News
Free Software & Digital Rights Noosphere
Free Software on Carmen Bianca BAKKER
Free Software with a Female touch
Free Software – Torsten's Thoughtcrimes
Free Software – hesa's Weblog
Free as LIBRE
Free, Easy and Others
FreeSoftware – egnun's blog
From Out There
Giacomo Poderi
Green Eggs and Ham
Handhelds, Linux and Heroes
HennR’s FSFE blog
Henri Bergius
Karsten on Free Software
Losca
MHO
Mario Fux
Matthias Kirschner's Web log - fsfe
Max Mehl (English)
Michael Clemens
Myriam's blog
Mäh?
Nice blog
Nikos Roussos - opensource
Planet FSFE on IRL.XYZ
Posts on Hannes Hauswedell
Pressreview
Rekado
Riccardo (ruphy) Iaconelli – blog
Saint’s Log
TSDgeos' blog
Tarin Gamberini
Technology – Intuitionistically Uncertain
The trunk
Thomas Løcke Being Incoherent
Told to blog - Entries tagged fsfe
Tonnerre Lombard
Vincent Lequertier's blog
Vitaly Repin. Software engineer's blog
Weblog
Weblog
Weblog
Weblog
Weblog
Weblog
a fellowship ahead
agger's Free Software blog
anna.morris's blog
ayers's blog
bb's blog
blog
en – Florian Snows Blog
en – PB's blog
en – rieper|blog
english – Davide Giunchi
english – Torsten's FSFE blog
foss – vanitasvitae's blog
free software blog
freedom bits
freesoftware – drdanzs blog
fsfe – Thib's Fellowship Blog
julia.e.klein’s blog
marc0s on Free Software
mina86.com (In English)
pichel’s blog
planet-en – /var/log/fsfe/flx
polina's blog
softmetz' anglophone Free Software blog
stargrave's blog
tobias_platen's blog
tolld's blog
wkossen’s blog
yahuxo’s blog