Building Castory: A Journey Through Market Research and Technical Implementation

As a final-year Information Systems student at the National University of Singapore, I had the unique opportunity to create an AI-powered audiobook platform for my graduation thesis.

Castory is designed to make the process of turning text into audiobooks faster, cheaper, and more accessible—especially for underrepresented languages and niche genres. In this post, I’ll walk you through the technical journey of building Castory, including the choices I made for the tech stack, the TTS engine I chose, how I tackled challenges, and the results we’ve achieved so far.

Understanding the Problem and Market Research

Before diving into code, the first step was to understand the problem and the need for a solution like Castory. The traditional audiobook production process is expensive and time-consuming. It typically involves professional voice actors recording in a studio, which leads to a limited selection of audiobooks, especially in niche genres or less popular languages. Through market research—interviews with potential users and analysis of existing competitors—I was able to confirm that there was a real demand for a more affordable and scalable solution.

People wanted a platform that could quickly convert books into audiobooks with diverse voice options and language support. Based on these findings, I set out to build Castory with the goal of making audiobook production faster, more cost-effective, and easily customizable for users.

The Tech Stack: Building Castory’s Core

When it came to choosing the technology stack, I wanted to ensure that Castory was built on solid, scalable, and user-friendly technologies. After considering different options, I opted for a combination of React and TailwindCSS for the frontend, Node.js and Express for the backend, and AWS for cloud infrastructure. However, the most crucial choice was the TTS engine. After experimenting with several options, I decided to integrate Qwen3 TTS—a cutting-edge text-to-speech model that offers natural-sounding voices and the ability to support multiple languages.

This decision was key to providing users with a high-quality audiobook experience, as Qwen3 delivers impressive voice variety, prosody, and tone modulation that makes the generated audio sound more human-like and less robotic. By using Qwen3 TTS, Castory can generate realistic audiobooks for a wide range of text genres, from fiction to self-help, in various languages, including English and Mandarin.

Frontend Development: A Seamless User Interface

Building the frontend in React allowed me to create a dynamic, responsive user interface that was easy to use while still being highly customizable. Users can upload text in different formats like .txt, .epub, and .pdf, select a voice from the available options, adjust the reading speed, and listen to a preview before finalizing the audiobook. I used TailwindCSS to design the interface quickly with a focus on minimalism and usability. On the backend, Node.js with Express powers the platform’s core functionality. The backend is responsible for handling user requests, processing uploaded text files, and coordinating the conversion process via the Qwen3 engine. Once the text is processed into audio, it’s stored securely on AWS S3, which allows users to download or stream their audiobooks without any delays.

Backend Development: Managing Requests and Processing Audio

While building Castory, I faced several challenges that pushed me to get creative with solutions. One major hurdle was scaling the TTS engine to handle multiple users at once without sacrificing performance. TTS conversion is a resource-intensive process, and I needed to ensure that the platform could handle a large number of simultaneous requests without performance degradation. To solve this, I utilized containerization and load balancing with Docker and AWS EC2 instances to distribute the workload and prevent bottlenecks.

Another challenge was ensuring the generated voices sounded natural, as many TTS engines still struggle with producing human-like speech. To overcome this, I worked closely with the Qwen3 API, fine-tuning settings for tone, pitch, and speed to create voices that were not only realistic but also suited for different genres of books. Additionally, I developed a Voice Casting feature that matches the tone of the voice with the content’s genre, such as using a calm, soothing voice for self-help books or a more dramatic voice for thrillers.

File Format and Parsing Challenges

Supporting various file formats also posed a technical challenge. The platform needed to process .txt, .epub, and .pdf files, which required custom parsers to extract the content while preserving its structure. This was particularly tricky with .pdf files, as they can contain complex layouts and non-standard text encodings. After iterating on the parsers, I was able to ensure that Castory could handle these formats smoothly without losing the integrity of the original content.

Results and Future of Castory

Now that Castory is up and running, it’s already showing promising results. Users have praised the platform for its ease of use and the high quality of the audiobooks generated. What would normally take weeks of professional recording can now be done in just a few hours, allowing authors and content creators to produce audiobooks quickly and affordably. In terms of language support, the Qwen3 engine has allowed Castory to expand its offerings, giving users the ability to create audiobooks in multiple languages and dialects. As a result, Castory is becoming an invaluable tool for authors, educators, and content creators looking to reach a wider audience through audiobooks.

Looking Forward: Expanding Castory’s Capabilities

Looking ahead, there’s still a lot more I want to do with Castory. The platform is already getting positive feedback, but there’s always room for improvement. I plan to expand the language options even further, adding support for more regional dialects and languages. I also want to develop more advanced features, such as AI-driven voice generation that can adjust based on the user’s mood or content type. Additionally, building a community around Castory—where authors, voice artists, and listeners can interact and share their work—is something I’m very excited about.

Thanks for following along with the journey of building Castory. If you’re interested in learning more or giving the platform a try, feel free to check out [https://x.com/castory\_x\].

I’m eager to continue improving and expanding Castory, and I’d love to hear your feedback.

Building Castory: A Journey Through Market Research and Technical Implementation

Understanding the Problem and Market Research

The Tech Stack: Building Castory’s Core

Frontend Development: A Seamless User Interface

Backend Development: Managing Requests and Processing Audio

File Format and Parsing Challenges

Results and Future of Castory

Looking Forward: Expanding Castory’s Capabilities

Comments

More from this blog

How I Designed a Sample-First Text-to-Audio Workflow for Long-Form Content

Command Palette

Understanding the Problem and Market Research

The Tech Stack: Building Castory’s Core

Frontend Development: A Seamless User Interface

Backend Development: Managing Requests and Processing Audio

File Format and Parsing Challenges

Results and Future of Castory

Looking Forward: Expanding Castory’s Capabilities

Comments

More from this blog