Building a #Gen AI Product for Model Documentation

In July, we launched AI-Assisted Documentation (AIAD) in Verta, a new feature that uses generative AI to help data scientists write high-quality model documentation in minutes rather than hours. We’re already getting some great feedback on this feature (and if you’ve ever felt like model documentation should be less tedious, we’d encourage you to try it out!)

In the spirit of sharing with others who are navigating the new, fast-paced, and sometimes overwhelming world of building with Gen AI, we wanted to reflect on the top lessons we learned from developing this feature.

Lesson 1: Prototyping is easy; Productizing is hard

We hacked up the first version of AI-Assisted Documentation in a few hours over a weekend. Its initial promise was AWESOME — give it a bit of input and have it produce beautiful and coherent documentation. However, productizing it took many many weeks and months, a process that included refining the model and prompts, collecting training and validation data, building the right APIs, building the right front-end experience for users to correctly utilize the model, and then making it production-ready so it could handle large amounts of traffic.

We found that the biggest hurdle in productizing a Gen AI application, besides nailing the user experience, is ensuring that the AI model produces consistently high-quality outputs in a variety of input scenarios (more on this in Lesson 2).

10% of the work is prototyping something cool, 90% of it is actually making it work each time and predictably.

Lesson 2. Making results consistent and “fit-for-purpose” is hard

Our GenAI application assists data scientists in writing model documentation. So “fit-for-purpose” in our case means that any text produced by a language model has to be acceptable to a DS, with characteristics like the appropriate tone, word choice, reading comprehension level, and brevity. Depending on the prompts we tried, GPT’s outputs ranged from “a high-schooler wrote this answer” (complete with filler words) to this description was “constructed by a proficient data scientist.”

Ultimately, achieving consistent, high-quality results meant many rounds of prompt engineering iterations, in addition to developing a training dataset of good model documentation examples.

Lesson 3. Hallucinations instantly break credibility

For our use case, data scientists would answer a series of questions and GPT would expand and rewrite their answers to create a complete set of model documentation. In text rewriting, hallucinations may not seem like such a big deal. How much can the model really veer off course?

We discovered that while, complete fabrications were unusual, the model would frequently use out-of-context information that instantly breaks credibility. For instance, in the example below, expanding demographics to “age, education, marital status etc” may be accurate in some cases — but in others, using marital status may be prohibited by law, making the expansion unacceptable.

We were able to minimize hallucinations using better prompting as well as training data sets, but with LLMs, it’s not possible to reduce the probability of hallucinations to zero. To further mitigate risk, we chose to have the user review and consent before using any generated text.

Lesson 4: Validation of results is hard

As noted widely, validation of results from a text generation model like GPT is extremely hard. When you're generating text to be fit-for-purpose, not only does the perplexity score matter — but the metrics that matter a lot more are things like tone, brevity, word choice, etc. (as discussed in Lesson 1). Many of these assessments today need human review and labeling. Therefore validation — and handling cases where validators disagree — is still extremely challenging and subjective.

PS: We are working on some very interesting work in the space and would love to bounce ideas with folks who have struggled with validation.

Lesson 5: UX is (almost) everything

Using generative AI in a workflow changes the nature of the workflow itself. In a typical software product, users can expect machine-generated outputs to always make sense. On the other hand, with generative AI, these outputs can be low-quality at times, so the UX needs to support an iterative, trial-and-error workflow. UX patterns for these kinds of workflows aren’t common (yet). Since users will be learning how to interact with the AI assistance for the first time, we also need many tutorials built into the product that help with discovery and onboarding.

We’d love to hear what you think of AI-Assisted Documentation in Verta. To check it out, go to app.verta.ai and look for the Documentation tab for any model. You can read more about the feature and our goal to reduce the time it takes to write model documentation by over 10x on our blog.

Subscribe To Our Blog

Get the latest from Verta delivered directly to you email.

5 Lessons From Building a #GenAI Product for Model Documentation