I participated again in Google Summer of Code with Kiwix, this time, reengineering the Zimfarm project. Zimfarm is a semi-decentralized software solution to build ZIM files efficiently. This means scraping web contents, packaging them into a ZIM file and uploading the results to an online ZIM files repository.
About the Organization
Kiwix is a non-profit organization and a free and open-source software project dedicated to providing offline access to free educational content. By compressing copies of entire websites into a single ZIM file such that they can fit on a user’s device, it provides applications that can read these local copies, thus, enabling people with no or limited internet access to enjoy the same browsing experience as anyone else.
Project Details
At a high level, the project comprises:
- Backend API - a REST API that manages recipes and distributes tasks.
- Frontend UI - a web UI to create/edit recipes and monitor tasks/workers.
- Workers - machines that run scrapers in containers to build ZIM files, then upload the results.
- Uploader & Receiver - handle file transfers and validate ZIM before publishing.
- Scrapers - conainerized tools (e.g for MediaWiki, Stack Exchange, Gutenberg) that convert websites into ZIM format.
Work Done
As of September 1st, 2025, the reengineering of the ZIMFarm project spanned over 100 pull requests, with Python and TypeScript serving as the primary programming languages used in development. The code for the project lives on openzim/zimfarm.
Starting from the backend, I began with the introduction of Hatch as a dependency manager to pin all dependencies to a specific version. I proceeded to upgrade some of the existing libraries to major versions while others were replaced entirely with more feature-rich ones. The most notable replacements in this ambitious reengineering were:
- Flask → FastAPI
- Marshmallow → Pydantic
- Paramiko and
subprocess.runcalls → Cryptography library - JavaScript → TypeScript
- Vue 2 → Vue 3
As a consequence of these upgrades and replacements, a few breaking changes were introduced (it was inevitable to keep them away). This meant the versioning of the backend API to v2, not only to take full advantage of features from library replacements but also to clean up parts of the old API that were fragile and inelegant such as:
- relying on
subprocess.runcalls to perform actions such as verification of authentication messages - query parameters that contained special characters crashing server (#1131)
- crashes in user creation when fields that should be required (like an email address) were missing #1058
- UI inconsistencies that caused buttons to remain active even when no changes were pending in the recipe editor (#994)
- ZIM metadata values not being escaped properly (#1203)
Aside from the library switches and upgrades, the reengineering introduced significant features including but not limited to:
- improved security by properly escaping flag inputs when constructing offliner commands (#1216)
- improved task assignment by introducing context-based filtering, ensuring tasks run only on compatbile workers (#1233)
- added support for SSH keys generated using the ECDSA algorithm (#1190, #1195)
- addded support for SSH keys generated using the Ed25519 signature scheme (#1279)
- standardized schedule language codes to the ISO 639-3 format (#1241)
- switched from bare bones
requirements.txtto hatch for dependency management (#1106) - used modern type annotations and tooling like Pyright and Ruff to enforce type-checking and code quality
- introduced functions to introspect Pydantic schemas and Python type definitions, thus, enabling the extraction of type information for validation and client-side reuse (#1150, #1246)
- enforced ZIM metadata conventions (#1224)
- improved the UI by making it more responsive, appealing and usable on small/mobile screens
To avoid turning the list into a long and boring changelog, I’ve highlighted only a few select improvements (not ranked in any way). If you are curious, you can browse the full list of pull requests on Github)
I tried to keep the UI largely the same as the orignal (partly because I am not good at design 😅) and only made ambitious changes in the UI where necessary. Relying heavily on Vuetify, I gave the UI a more modern design and introduced some additional features and pages.
Challenges
Reengineering a project of this size was by no means a small feat and it was challenging as much as it was exciting. The biggest challenges I faced (in no particular order) during the project revolved around
- converting Marshmallow models to Pydantic models
- extracting metadata from Pydantic models to be able to share validation rules to clients. To be able to do this cleanly, there was intropsection using the typing module (more on that in a later blog post).
- relaxing validation while reading existing data from the database, but enforcing at writes. A lot of this meant I had to wrap around validators to make them skip validation based on context.
- using Pydantic’s discriminators to be able to discriminate between the different schemas of the various scrapers/offliners.
- employing PostgreSQL JSON functions to query JSONB columns whose underlying types were changed
Again, this list is not exhaustive and does not fully capture the variety of hurdles I faced during the project, but it highlights some of the more interesting ones.
The frontend became a hands-on learning journey that helped me improve my proficiency with Vue 3 and TypeScript. Prior to this, I had only used them sparingly (I cannot remember the last time I did something frontend-related). But after the first couple of weeks, I began to find my feet, thanks to foundational knowledge in JavaScript.
Future Work
We plan to wrap up this GSoC project with the deployment of the zimfarm-upgrade branch on September 8th, 2025. Of course, the journey doesn’t end
here. There’s still plenty of work ahead with issues of different priorities ranging from
prio1,
prio2, and
prio3 to a broader
backlog waiting to be tackled.
I’m very glad to share that I’ll continue working with Kiwix as a contractor until at least the end of 2025, which is both a testament to their trust in the work I’ve done and to how much I’ve developed since my first pull request to their codebase two years ago.
Things I Learned
Reengineering the project exposed me to more situations where I had to make decisions regarding backwards-compatibility and the trade-offs associated with it.
Navigating the challenges mentioned earlier meant I had to introspect heavily in order to be able to come up with an elegant solution while still maintaing strict type-checking standards. Also, I learnt newer things about SQL, most importantly the JSON functions and how to wrangle data in the database.
If last year’s participation in the Google Summer of Code changed the way I wrote code, this year’s edition deepened my engineering discipline and how I think about building systems.
Acknowledgements
I express gratitude to Google for providing me with this opportunity to contribute to Open Source Software for the second time in a row.
Thanks to the team at Kiwix for reviewing my pull requests during the submission phase with the same responsiveneess, accepting my proposal and making this a reality.
Whether you are a newbie or a seasoned developer looking to get started with open-source and collaborative development, I implore you to start with the Kiwix codebase. The team is incredibly responsive, offers constructive feedback, and makes the contribution process both welcoming and rewarding. You’ll not only sharpen your technical skills but also get the chance to work on projects that make a real impact.
Most thanks of all goes to my project mentor Benoît Béraud for his feedback and help with challenges during the project. Without his feedback, none of this would have been possible as he almost always had a suggestion when I hit a wall. His careful organization of the issues and detailed explanations meant there was a little to and fro on the issues, thus, accelerating the rate of development.
Working with him significantly improved my approach to problem-solving.