This is my third update on writing a language server for Debian packaging files, which
aims at providing a better developer experience for Debian packagers.
Lets go over what have done since the last report.
Semantic token support
I have added support for what the Language Server Protocol (LSP) call semantic tokens. These
are used to provide the editor insights into tokens of interest for users. Allegedly,
this is what editors would use for syntax highlighting as well.
Unfortunately, eglot (emacs) does not support semantic tokens, so I was not able to test
this. There is a 3-year old PR for supporting with the last update being ~3 month basically
saying "Please sign the Copyright Assignment". I pinged the GitHub issue in the hopes it will
get unstuck.
For good measure, I also checked if I could try it via neovim. Before installing, I read
the neovim docs, which helpfully listed the features supported. Sadly, I did not spot
semantic tokens among those and parked from there.
That was a bit of a bummer, but I left the feature in for now. If you have an LSP capable
editor that supports semantic tokens, let me know how it works for you! :)
Spellchecking
Finally, I implemented something Otto was missing! :)
This stared with Paul Wise reminding me that there were Python binding for the hunspell
spellchecker. This enabled me to get started with a quick prototype that spellchecked the
Description fields in debian/control. I also added spellchecking of comments while
I was add it.
The spellchecker runs with the standard en_US dictionary from hunspell-en-us, which
does not have a lot of technical terms in it. Much less any of the Debian specific slang.
I spend considerable time providing a "built-in" wordlist for technical and Debian specific
slang to overcome this. I also made a "wordlist" for known Debian people that the
spellchecker did not recognise. Said wordlist is fairly short as a proof of concept, and
I fully expect it to be community maintained if the language server becomes a success.
My second problem was performance. As I had suspected that spellchecking was not the
fastest thing in the world. Therefore, I added a very small language server for the
debian/changelog, which only supports spellchecking the textual part. Even for a
small changelog of a 1000 lines, the spellchecking takes about 5 seconds, which
confirmed my suspicion. With every change you do, the existing diagnostics hangs around
for 5 seconds before being updated. Notably, in emacs, it seems that diagnostics
gets translated into an absolute character offset, so all diagnostics after the change
gets misplaced for every character you type.
Now, there is little I could do to speed up hunspell. But I can, as always, cheat.
The way diagnostics work in the LSP is that the server listens to a set of notifications
like "document opened" or "document changed". In a response to that, the LSP can start
its diagnostics scanning of the document and eventually publish all the diagnostics to
the editor. The spec is quite clear that the server owns the diagnostics and the
diagnostics are sent as a "notification" (that is, fire-and-forgot). Accordingly, there
is nothing that prevents the server from publishing diagnostics multiple times for a
single trigger. The only requirement is that the server publishes the accumulated
diagnostics in every publish (that is, no delta updating).
Leveraging this, I had the language server for debian/changelog scan the document and
publish once for approximately every 25 typos (diagnostics) spotted. This means you quickly
get your first result and that clears the obsolete diagnostics. Thereafter, you get
frequent updates to the remainder of the document if you do not perform any further changes.
That is, up to a predefined max of typos, so we do not overload the client for longer
changelogs. If you do any changes, it resets and starts over.
The only bit missing was dealing with concurrency. By default, a pygls language server
is single threaded. It is not great if the language server hangs for 5 seconds everytime
you type anything. Fortunately, pygls has builtin support for asyncio and threaded
handlers. For now, I did an async handler that await after each line and setup some
manual detection to stop an obsolete diagnostics run. This means the server will fairly
quickly abandon an obsolete run.
Also, as a side-effect of working on the spellchecking, I fixed multiple typos in the
changelog of debputy. :)
Follow up on the "What next?" from my previous update
In my previous update, I mentioned I had to finish up my python-debian changes to
support getting the location of a token in a deb822 file. That was done, the MR
is now filed, and is pending review. Hopefully, it will be merged and uploaded soon. :)
I also submitted my proposal for a different way of handling relationship substvars to
debian-devel. So far, it seems to have received only positive feedback. I hope it stays
that way and we will have this feature soon. Guillem proposed to move some of this into
dpkg, which might delay my plans a bit. However, it might be for the better in the
long run, so I will wait a bit to see what happens on that front. :)
As noted above, I managed to add debian/changelog as a support format for the
language server. Even if it only does spellchecking and trimming of trailing newlines
on save, it technically is a new format and therefore cross that item off my list. :D
Unfortunately, I did not manage to write a linter variant that does not involve using
an LSP-capable editor. So that is still pending. Instead, I submitted an MR against
elpa-dpkg-dev-el to have it recognize all the fields that the debian/control
LSP knows about at this time to offset the lack of semantic token support in
eglot.
From here...
My sprinting on this topic will soon come to an end, so I have to a bit more careful
now with what tasks I open!
I think I will narrow my focus to providing a batch linting interface. Ideally, with
an auto-fix for some of the more mechanical issues, where this is little doubt about
the answer.
Additionally, I think the spellchecking will need a bit more maturing. My current
code still trips on naming patterns that are "clearly" verbatim or code references
like things written in CamelCase or SCREAMING_SNAKE_CASE. That gets annoying
really quickly. It also trips on a lot of commands like dpkg-gencontrol, but that
is harder to fix since it could have been a real word. I think those will have to be
fixed people using quotes around the commands. Maybe the most popular ones will end
up in the wordlist.
Beyond that, I will play it by ear if I have any time left. :)