-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I have a FOSS project whose web site is generated by
asciidocand some custom scripts as an horde (thousands) of static files locally in the source files' repo, copied into another workspace and uploaded to github.io style repository, and eventually is rendered as an HTTP server for browsers around the world to see.Users occasionally report that some of the links between site pages end up broken (lead nowhere).
The website build platform is generally POSIX-ish, although most often the agent doing the regular work is a Debian/Linux one. Maybe the platform differences cause the "page outages"; maybe this bug is platform-independent.
I had a thought about crafting a check for the two local directories as well as the resulting site to crawl all relative links (and/or absolute ones starting with its domain name(s)), and report any broken pages so I could focus on finding why they fail and/or avoiding publication of "bad" iterations - same as with compilers, debuggers and warnings elsewhere.
The general train of thought is about using some
wgetspider mode, though any other command-line tool (curl,lynx...), python script, shell withsed, etc. would do as well. Surely this particular wheel has been invented too many times for me to even think about making my own? A quick and cursory googling session while on commute did not come up with any good fit however.So, suggestions are welcome :)
Posted as a question at https://unix.stackexchange.com/questions/775994/how-to-check-consistency-of-a-generated-web-site-using-recursive-html-parsing