The Crow's Nest

No Skill. No Taste.

crow — Fri, 20 Feb 2026 06:00:21 GMT

Ironic image generated with nano banana to showcase my point.

I was reading a thread on HN and I started writing this super long comment and rewriting and editing and thought, hey, if I'm doing this I clearly care enough about the state of Show HN and HN in general to write a post on it. I've written code since I was 11. I've worked on larger distributed systems, web apps, databases, search and more. I have many opinions on the transformation of our profession that is currently underway. Most of all, there is now an illusion of a lower barrier to entry. There is a magic quadrant made up of taste and skill. And too many people over estimate their taste[0] and their skill (or never care in the first place).

LLMs have people everywhere super excited they can finally build their dream applications! The only problem is, no one needs their dream application. We see it everyday now, someone posts some obvious vibe coded app which is poorly crafted and clearly derivative of an idea so thoroughly saturated it's literally leaking. This is the lowest part of the quadrant. No skill and no taste. The overall suffusion of this into the broader scene rightly has the more sensitive of us up in arms. It's noise, it's spam, it's a perversion of the years of skill we've spent accruing.

The only problem there is you might have skill, but do you have taste? This problem itself isn't new. HN of all places has always been a matter of taste. Things people found interesting made it to the front page, things they did not languished. You could build the most finely abstracted todo app of all time and your app would be dead on arrival. However, if you built something that resonated with a large enough group of people it never mattered how well built the app was or how technically complex.

I've seen plenty of content on HN that could not have been more than a simple crud app that rocketed to the front page. What comes to mind immediately was a little app that died if someone hadn't posted a message on it in 24 hours. Inherently simple, but quite popular. It was pure taste.

Taste and skill are related, the more saturated something is the higher skill you need to cross the taste threshold to make people care. It's not that there will never be another interesting todo app, it's that it has to be so tasteful as to cross our maximal standards and pre-existing expectations of them.

LLMs have exposed this more thoroughly than any other time in tech so far. The sin isn't that someone uses an LLM to generate an application[1], vibe[2] or not. The sin is they lacked enough skill and enough taste to cross the actual threshold the rest of us need to see for the work to not be slop.

An obvious and recent example of this is OpenClaw. It is a bit of a software nightmare (sorry Peter, I know you're good), but it's highly tasteful even being pretty vibey. People ate it up immediately and because there was such an interest the lack of technical soundness and security was overlooked (or begrudgingly put up with)

The lack of taste only presents a problem now, because it's so much easier for people who thought they have more taste than they actually do to post every little idea they have. This is a real problem and I think it will taper off because people will learn proper etiquette or face disappointment. It's a massive educational period for a lot of people that we've all had years to internalize.

It has the same stink of crypto on it right now that anyone can get rich. Most of them won't. This is the illusion of the lower barrier of entry, the barrier has always been taste and LLMs do nothing to remove this barrier. They amplify it.

Anyway this is all to say whether you have skill or not, you better learn to be tasteful before you decide to slop all over everyone.

[0] Taste is totally dependent on the group you're building for, discerning whether you have good taste and to whom is totally a process where you do have to put things out to people, but the bar has not now and not in my years ever been on the floor so I assert there's a minimal universal taste we all have and you should at least clear that before putting things out there.

[1] I've been writing code for 20 years, I am super experienced in my domains and I review and sand off the edges, make changes myself etc. I vibe code almost 0% of the time.

[2] Vibing means you need to have exceptional taste to cross the bar. I don't care if you do it, but you need to own the outcome.

Finetuning RWKV 14bn with QLORA in 4Bit

crow — Fri, 26 May 2023 23:41:49 GMT

It was surprisingly easy to get this working, and I think that's a good thing.

First I looked at existing LORA implementations of RWKV which I discovered from the very helpful RWKV Discord. The link I found in the discord landed me at "How to Train Your Raven", shout out to the author Nana. From that blog post I found the general lora implementation from Blealtan.

You can see starting on line 153 and going up to line 192 the modules they've chosen to augment with LORA. Which are the linear attention and feed forward portions of the network, and more specifically they are later applied to the linear layers for receptance, key and value.

class LoraLinear(nn.Module):

    def __init__(self, in_features: int, out_features: int, bias: bool):
        super().__init__()

        self.weight = nn.Parameter(torch.empty((out_features, in_features)))
        assert bias == False, "Biased LoraLinear not supported"

        r, alpha, dropout = LORA_CONFIG["r"], LORA_CONFIG[
            "alpha"], LORA_CONFIG["dropout"]
        self.lora_A = nn.Parameter(torch.empty(r, in_features))
        self.lora_B = nn.Parameter(torch.empty(out_features, r))
        self.lora_dropout = nn.Dropout(dropout)
        self.scaling = alpha / r

        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

    def forward(self, x):
        return (
            F.linear(x, self.weight) + self.scaling *
            F.linear(F.linear(self.lora_dropout(x), self.lora_A), self.lora_B))


@functools.wraps(LoraLinear)
def make_linear_att(*args, **kwargs):
    if "att" in LORA_CONFIG["parts"] and LORA_CONFIG["r"] > 0:
        return LoraLinear(*args, **kwargs)
    else:
        return nn.Linear(*args, **kwargs)


@functools.wraps(LoraLinear)
def make_linear_ffn(*args, **kwargs):
    if "ffn" in LORA_CONFIG["parts"] and LORA_CONFIG["r"] > 0:
        return LoraLinear(*args, **kwargs)
    else:
        return nn.Linear(*args, **kwargs)

After looking at the RWKV paper and the model architecture, combined with this LORA implementation I figured that these were the correct modules to target.

Recently I have implemented my own fine tuning library on top of Transformers, Accelerate and PEFT which is iantbutler01/ditty, and the rest of this was simply integrating QLORA into Ditty under an experimental flag.

Implementing QLORA in your Transformers based Pipeline using PEFT

You can install the development versions of Transformers, PEFT and Accelerate with the below snippet. I recommend doing this in a separate virtualenv so that potential breaking changes do not interfere with the other work that you may be doing!

pip install -U git+https://github.com/huggingface/transformers.git 
pip install -U git+https://github.com/huggingface/peft.git
pip install -U git+https://github.com/huggingface/accelerate.git
pip install --upgrade bitsandbytes

Development versions are needed because this isn't stable yet!

After this you can follow the excellent blog post introducing the method from Huggingface, I'll show my implementation of this into Ditty as well with the few changes needed to support RWKV.

After checking some flags set in my library, I enable 4bit and pass the necessary config to Bitsandbytes per the HF post,

  		if self.l8bit and self.l4bit:
            raise ValueError("Cannot set both l8bit and l4bit to True.")

        if self.l4bit and experimental:
            self.bnb_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch.bfloat16,
            )
        elif self.l4bit and not experimental:
            raise ValueError("To use 4bit, `experimental` must be set to True.")
        elif self.l8bit:
            self.bnb_config = BitsAndBytesConfig(
                load_in_8bit=l8bit, llm_int8_enable_fp32_cpu_offload=fp32_cpu_offload
            )

# I pass the BNB config directly to the model loading step

self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name_or_path,
            device_map="auto",
            quantization_config=self.bnb_config
        )

To enable targeting of RWKV modules I first printed out the model to see how they are named in the Transformers version just in case they're different.

RwkvForCausalLM(
  (rwkv): RwkvModel(
    (embeddings): Embedding(50277, 5120)
    (blocks): ModuleList(
      (0): RwkvBlock(
        (pre_ln): LayerNorm((5120,), eps=1e-05, elementwise_affine=True)
        (ln1): LayerNorm((5120,), eps=1e-05, elementwise_affine=True)
        (ln2): LayerNorm((5120,), eps=1e-05, elementwise_affine=True)
        (attention): RwkvSelfAttention(
          (time_shift): ZeroPad2d((0, 0, 1, -1))
          (key): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (value): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (receptance): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (output): Linear4bit(in_features=5120, out_features=5120, bias=False)
        )
        (feed_forward): RwkvFeedForward(
          (time_shift): ZeroPad2d((0, 0, 1, -1))
          (key): Linear4bit(in_features=5120, out_features=20480, bias=False)
          (receptance): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (value): Linear4bit(in_features=20480, out_features=5120, bias=False)
        )
      )
      (1-39): 39 x RwkvBlock(
        (ln1): LayerNorm((5120,), eps=1e-05, elementwise_affine=True)
        (ln2): LayerNorm((5120,), eps=1e-05, elementwise_affine=True)
        (attention): RwkvSelfAttention(
          (time_shift): ZeroPad2d((0, 0, 1, -1))
          (key): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (value): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (receptance): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (output): Linear4bit(in_features=5120, out_features=5120, bias=False)
        )
        (feed_forward): RwkvFeedForward(
          (time_shift): ZeroPad2d((0, 0, 1, -1))
          (key): Linear4bit(in_features=5120, out_features=20480, bias=False)
          (receptance): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (value): Linear4bit(in_features=20480, out_features=5120, bias=False)
        )
      )
    )
    (ln_out): LayerNorm((5120,), eps=1e-05, elementwise_affine=True)
  )
  (head): Linear(in_features=5120, out_features=50277, bias=False)
)

So in this case we're looking at 'key', 'value' and 'receptance'. Then I added a few small changes to my existing pipeline in Ditty.

        if "gpt-neox" in self.model_name_or_path:
            target_modules = ["query_key_value", "xxx"]

        if "rwkv" in self.model_name_or_path:
            target_modules = ["key", "value", "receptance"]

        peft_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM,
            target_modules=target_modules,
            inference_mode=False,
            r=8,
            lora_alpha=16,
            lora_dropout=0.05,
            bias="none"
        )


        if self.l4bit:
            from peft import prepare_model_for_kbit_training
            self.model = prepare_model_for_kbit_training(
                self.model, use_gradient_checkpointing=self.gradient_checkpointing
            )
        elif self.l8bit:
            self.model = prepare_model_for_int8_training(
                self.model, use_gradient_checkpointing=self.gradient_checkpointing
            )

Its worth calling out that I disabled gradient checkpointing, the first time I tried to run with it enabled it errored out so I left it off. I think it's worth investigating and seeing how this can be enabled however.

That's really all there is to it. I also have a few things like bfloat16 enabled for the training pipeline, you can see my configuration using a subclassed Ditty pipeline here:

if __name__ == "__main__":
    pipeline = RWKVPipeline(
        dataset_name="databricks/databricks-dolly-15k",
        model_name_or_path="RWKV/rwkv-raven-14b",
        gradient_checkpointing=False,
        block_size=512,
        grad_accum=32,
        batch_size=1,
        l4bit=True,
        l8bit=False,
        experimental=True,
        fp16=True,
        use_bfloat16=True
        
    )

    pipeline.run()

The link to the implementation can be found here.

At the time of writing this the fine tuning process has finished but I have not yet evaluated the model for quality, I will update when I do!

Trying to steer LLM output towards correctness using MIPS

crow — Wed, 17 May 2023 02:37:49 GMT

Foreword

This is what has finally convinced me to launch a blog. I do a lot of programming, but I don't necessarily believe that the web needs yet another programmer talking about their development experiences or that my experiences generally merit being shared, but I've been following the NLP space since 2016 and with how exciting of a time it is, I can't help but throw my hat in the ring. I hope you, dear reader, find this interesting if not useful. -KC

What if...?

Recently I began work on what people are referring to as a "semi autonomous agent" framework, Asimov. I decided that though AutoGPT is cool, the ability to mix and match smaller open source models with proprietary models like GPT4 may yield better results. I also have different opinions on how these agent loops should be structured and how tasks can be achieved. While working on this I began to experiment with Dolly V2 a 12BN parameter instruct tuned model released by Databricks. Frankly, I found the small model to be quite capable but the moment you reached a little outside of the zone of tasks it was trained on it fell apart pretty quickly.

At the same time I began to look for ways to stabilize the generation of JSON from these models. While doing so I stumbled on Structural Alignment: Modifying Transformers (like GPT) to Follow a JSON Schema, or Clownfish from Ben Newhouse. Ben achieved the ability to consistently output valid JSON by taking the logits (scores for a set of potential tokens before being converted to a proper probability) produced by an LLM and testing them against a streaming JSON parser for a valid token matching the current part of the JSON schema being asked for and then would pin all other logits to -infinity, very harshly steering the model away from other paths.

This sparked a thought in me. What if you could also do so with the prompt fed to the model? What if you could find the candidate token that when combined with the previous outputs would produce the smallest cosine distance to the input prompt for each step of the generation and selecting that candidate, with the idea that this would align it more closely with the input prompt and prevent the model from going off task. This is basically a form of MIPS or maximum inner product search but for choosing the next LLM output token instead of a document.

First try!

Given the input prompt and the current set of tokens generated from the model so far:

Produce a document embedding for the prompt using the model's input embedding layer, by taking all token embeddings in the input prompt summing them and then dividing by the number of tokens in the prompt.
Produce embeddings for all of the tokens created so far.
Take the logits for the next set of potential tokens and iterate through them.
During the iteration, produce an embedding of the current iteration's candidate token
Sum the embeddings of all the tokens produced thus far and the embedding of the current candidate token.
Divide those summed vectors by the length of all previous tokens, plus one for the candidate token to produce a naive document embedding.
Take the cosine distance of the document embedding and the prompt embedding.
Take the candidate token with the smallest distance, and set the logits of all the others to -infinity using a mask.

The original implementation

First problem, first solution.

Great, this was going to be so cool and really improve model output! Except... I forgot that each token generated by the model is actually a subword, LLMs generally use a byte pair encoding (BPE) and produce output at a subword level. Think 'a' 'pp' 'le' and not 'apple'. It's a little more complicated than that with respect to BPE, but that gets the general point across. What this means in practice is that even if the token minimizes the distance to the prompt it may not, when joined with other tokens, produce a coherent english word. And in fact, that is exactly what happened.

Youmedscimonit unmist Syrac Gmb DeborPtrFromString� enthus�

That was the first set of tokens generated by the implementation above. You'll notice 'PtrFromString' this is what gave me the hint that I was in fact seeing an issue produced by the subwords because this is very likely a super rare token and not something broken down into component subwords by the BPE scheme. You can learn more about BPE and why I would think that from this concise overview of BPE by Huggingface. Back to the drawing board.

Maybe I could maintain some type of state, some combination of ngrams and test them to see if english was being produced, maybe implement some kind of back tracking to go through other generations if this wasn't happening. Maybe just check a dictionary. Ultimately, all of those ideas are either more complex than I wanted to deal with or very lossy like the dictionary idea because it would be highly inflexible, not unlike the problem the BPE had sought to correct for model generation.

I realized I was kind of being dumb. Logits are scores, and they represent the functional confidence that from some sampling scheme a token would be produced in occurrence with the other tokens in the output. It naturally encodes for fighting gibberish. I just wanted to influence that confidence, not entirely zero it out or pin it to -infinity. I should just subtract the distance from the each logit instead. Now the logic encoded against gibberish and had a bias towards the input prompt. Would it work? Initially I thought so and the prompt below would seem to confirm it, but I've run into some interesting benchmark results that I discuss later on in the post.

For the prompt:

Please write me a task planning application using golang for the API and nextjs for the frontend. Assume that everything you need to build the application is already installed.

Using the Dolly V2 12bn model, this is the first output the model generated without applying the steering:

step 1: Fetch data from google sheets
 step 2: Parse the data from google sheets into a database structure
 step 3: Push the data into Google cloud storage

 google sheets data: https://docs.google.com/spreadsheets/d/1ZK4Z0uB5fC7_XNGo8cDRz7V5L9B_N9SJvDdE9KfKzsKJ0JWG3uvIjzOJLZNQM8EoQg1HmNyQ4/edit?usp=sharing

 gcloud command to push the data to google cloud storage:
 gcloud storage blob write [YOURBUCKETNAME]/[FILENAME] [URL] -- application/octet-stream
 gcloud storage -v upload [YOURBUCKETNAME]/[FILENAME]

 nextjs frontend code:
 // fetch data from google cloud storage
 const { storage } = require("@google-cloud/storage-backend");
 // parse the data from google cloud storage into a database structure
 async function getDataFromStorage() {
   const response = await storage.bucket().file

And this is the first output it generated with the newly improved steering:

You will need to use a postgresql database for this application but you do not need to create a database. The code for the application can be found here:
app.get("/", function (req, res) {
res.render("index.njk");
});

// TODO: Add task planning application functionality

/*
This is a start but you will need to complete the following:

Add functionality to allow users to create, edit, and delete tasks
Add functionality to allow users to assign tasks to users
Add functionality for users to view all their tasks
Add pagination to display tasks in the API
Add task statuses (to-do, done, in-progress)
Add due dates to tasks
Add notifications to tasks
You will also need to complete the following:

Handle errors in your application
Configure your database
After completing this exercise, you can delete the TODO comments to finish your application.

https://exercism.com/tracks/haskell/exercises/task-planning-application

*/

Next.js is a simple, fast and flexible framework

In the words of my friend who I was incessantly DMing as I was working on this, "Night and Day". However, as stated above, when considering the benchmarking I believe this was a fluke.

Second problem, second solution

The original implementation was pretty slow. We're talking a minute or more for 100 tokens. If you've made it this far and have some experience you'll have noticed a big loop and some pretty inefficient math that barely uses the benefits of torch. To fix this problem, we can vectorize everything! As it turns out all the calculations from the general document embeddings to calculating the distance of candidates to the prompt are able to be turned into stacked matrices and then computed in one go versus iterating. Some of the calculations are also able to be run once and cached instead of being called every iteration.

Everything is a matrix!

This brought everything in the realm of usability, I would even call it fast. A lot of this was low hanging fruit, but it is still always enjoyable to take a piece of code that works, but works slowly, and crank out some multiple hundreds percent speed up. I won't go through the exercise of benchmarking this for exactly how much of an increase but it's now more than useable.

My concerns and lukewarm results

I haven't been able to produce numbers that show the success I was seeing above, and I had run this many times both with and without QueryAlignment and subjectively the QueryAlignment appeared better to me. Naturally I wanted to put some numbers to this.

It turns out finding benchmarks for generating long task output is somewhat challenging, so I decided to try this against HumanEval which is a set of code generation tasks and after trying a few different combinations of tests and making edits here and there as I discovered issues with my implementation, I managed to perform about 5% worse than the stock Dolly V2 12Bn model, which is honestly a pretty surprising result.

I should note that of the seven task categories Dolly V2 12BN was trained for, code generation was not one of them, even though it has clearly been trained on code. In general it only managed to score about 3.5% without query alignment and 3% with. Even with that, having seen many subjectively good generations somewhat related to codegen while using this alignment method I had expected at least a marginal improvement.

Conclusion

It was pretty interesting to see such a result from this even though it wasn't a smashing success. In fact I really think there is a lot of room for both research and immediately practical uses in augmenting the logits produced by these LLMs. In a lot of cases the correct sequence exists within the search space and just requires a nudge in the right direction to produce them much like with Ben Newhouse's Clownfish JSON parser. Methods like this seem pretty under explored, at least in language models. Other models in other spaces like Stable Diffusion have things like ControlNet which is a more advanced form of shaping outputs from a model and we don't quite have anything like that for LLMs at least not as mainstream or publicly released. I did see that Huggingface has a sampling solution that makes use of a smaller LLM to aid a larger one and keep it on task and it wouldn't surprise me if OpenAI has something even more advanced than that.

If anyone has any ideas for better ways to test the effectiveness of this method, or improvements I could make that might shake up the lukewarm results, I'd be happy to hear them! All of this code is available in this repo and my generations for the humaneval benchmark are there as well under benchmarks https://github.com/iantbutler01/asimov

Future work, other problems

Could you apply this to focus a model's outputs to keep it more factual relative to some text that is not the prompt itself?
Is there a more effective way to use this to score the logits than just subtracting the distance from the score or multiplying the distances with the score?
Occasionally the model will just output the prompt itself
What other processors could we implement to improve the quality of smaller models that can be run locally?
This solution uses pretty naive embedding schemes to create document embeddings, maybe use something like SBert to produce better representations