In January I've removed tens of thousands of web pages on
www.debian.org. Have you noticed it?
In the past
From 1997 onwards, we had web pages for security announcements. We
had to manually prepare a .data and a .wml file which then generated a
web page for each security announcement (DSA or DLA). We have listed
the 6 most recent messages in a short list that was created from these
files. Most of the work that went into the Debian web pages was
creating these files.
Our search engine often listed the pages with security announcements instead
of a more relevant web page for a particular topic.
Preparation
At DebConf Kosovo (2022) I started with a proof of concept and wrote a
script, that generates this list without using the .data/.wml files in
the Git repository, but instead reading the primary sources of
security information
[1]. This new list now includes links to the
security tracker and the email of the announcement.
Following web pages and scripts were also using these .data and .wml
files:
- OVAL files
- RSS feeds for security announcements (and LTS)
- Apache config file for mapping URLs from dsa-NNN to YEAR/dsa-NNN
- A huge list of crossreferences between DSA and CVE numbers
Before I could remove all the security web pages, I had to adjust the
scripts, that create the above information.
When I looked at the OVAL files and the apache logs of our web server,
I saw that more than 99% of the web traffic was generated by these XML
files (134TB of 135TB total in two weeks). They were not compressed and were around 50MB in size.
With the help of Carsten Sch nert we managed to modify the python
scripts that generate this OVAL file without using the .data/.wml
files and now we only provide bzip2 compressed XML files
[2].
The RSS feeds are created by the new Perl script which reads the
DSA/DLA list the security tracker and determines the URL of the email
of all entries. This script also generates the list of the most recent
DSA/DLA entries. Currently we show the last 350 entries which covers
more than the last year and includes links to the announcement email
and the security tracker.
The huge list of crossreferences is not needed any more, since the
mapping of CVE to DSA is already included in the DSA
list
[3]
of the security tracker.
The amount of translations of the DSA/DLA was very different. French
translations were almost all done, but all other languages did
translations for a couple of months or years only.
E.g. in 2022, Italian had 2 translations, Russian 15, Danish 212,
French and English each 279. But from 2023 on only French translations
were made. By generating the list of DSA/DLA we lost the ability to
translate these web pages, but since these announcements are made of
simple, identical sentences it is easy to use an automatic
translation service if needed.
Now the translation statistics of all web pages are more
accurate. Instead of 12200 pages that need to be translated
(including all these old DSA/DLA) there are now only 2500 pages to
translate
[4]. Languages
that had a lot of old translations of DSA/DLA lost some percentage but
languages that are doing translations of newer web pages won in the
statistics of how many pages are translated. Examples:
Before
German (de) 3501 28.5%
Italian (it) 1005 8.2%
Danish (da) 6336 51.7%
After
German (de) 1486 59.0%
Italian (it) 909 36.1%
Danish (da) 982 39.0%
Cleanup of all the security web pages
Finally in January, I could remove all web pages of the security announcements in
one git commit
[5].
Using several
git rm -rf
commands this commit
removed 54335 files, including around 9650
DSA/DLA data files, 44189 wml files, nearly 500 Makefiles.
Outcome
No more manual work is needed for the security team and we now have
direct links from a DSA-NNN/DLA-NNN to the email in our mailing list
archive. This was not possible before.
The search results became more accurate.
But we still host a lot of other old content on the Debian web pages
which may be removed in the future.
[1]
https://www.debian.org/security/#infos
[2]
https://www.debian.org/security/oval/
[3]
https://salsa.debian.org/security-tracker-team/security-tracker/-/raw/master/data/DSA/list
[4]
https://www.debian.org/devel/website/stats
[5]
https://salsa.debian.org/webmaster-team/webwml/-/commit/2aa73ff15bfc4eb2afd85c