Conversation

I spent the last 48 hours making GPT-4 read the entire validator codebase and write documentation, so doesn't have to. Introducing TolyGPT.com — an AI-powered chatbot trained on nothing but code that can answer deep technical questions. How it works 👇 But first... A huge shoutout to , , and . Their hard work made prototyping this thing a breeze. Without further ado... Devs like to write code, not documentation. Tribal knowledge is lost when devs move on to other projects, leaving future devs to sort through mountains of code and figure out not just how it works, but why it works that way. This is all about to change. GPT-4's ability to write code is stunning. It seems to understand something fundamental about writing software that previous models just didn't. This comprehension of the principles that drive the design behind a complex system carries over into its ability to document existing codebases in a truly impressive way. With the enlarged context window(s), it's now feasible to feed GPT-4 entire files of code and ask it to write documentation about how the code works. Taking this as a starting point, the process looks something like this: 1. Download repo. 2. Depth-first traversal of repo contents, ignoring things like package-lock and binary files. 3. For each file, feed to GPT-4 and ask it to write documentation in markdown. 4. Save the output in a separate location as [outputRoot]/[inputFilepath][inputFilename].md 5. For each folder, we ask GPT-4 to write a summary of the folder, taking the newly generated documentation for all files in the folder and the summaries from each of its subfolders as context. Write this to the filesystem as markdown. Now we have a filesystem that matches the structure of the input repo, but all files in the tree are markdown documentation of the corresponding code file. From here, we: 1. Load markdown documents into LangChain. 2. Embed all documents via embeddings. 3. Store embeddings in . When a user sends a query: 1. Embed query. 2. Find k-nearest markdown files. 3. Feed to GPT-4 with a prompt asking to answer the query based on k-nearest markdown documents provided. The craziest part of all this? GPT-4 actually wrote ~30% of the code. The results are pretty good for 2 days of work. There is certainly room for improvement. Some items that are top of mind: 1. TolyGPT will occasionally hallucinate answers. It is especially bad with links to external sources, like GitHub. The base model seems to know a bit about Solana already, and sometimes this creeps in. Fine-tuning the prompt can solve some of this. 2. Context selection is difficult in a codebase this large. For example, sometimes it will pull in details about the Solana SDK when asked about transaction processing. The SDK files can seem relevant depending on the phrasing of the question. It may be worth breaking the documentation into subsystems to limit this. 3. Not all files fit into the 32k token window. As of now, there are 23 (out of ~1,100) files that cannot be documented in their entirety. Some of these files are very important to how Solana works. Final thoughts: 1. GPT-4 is super powerful, and we're going to see a ton of tools that supercharge the entire software development lifecycle. This is not 12 months away. For the people that can afford it, these tools are here now. And they're only getting better. Act accordingly. 2. The price of inference has to come down for this to go mainstream. I spent about $300 prototyping this project, and the final crawl cost about the same. The high cost of GPT-4 will push developers to other, cheaper alternatives with similar performance. This is coming very soon. If you have a large software project and you're interested in something like this for your codebase, fill out this form tally.so/r/nr502L, and we'll be in touch this week. Or just DM me :)