COSS_ARCHIVING
A utility to
- fetch article requests from slack
- generate pdfs for them
- compress them
- send them via slack + email
- upload them to the COSS NAS
... fully automatically. Run it now, thank me later.
Running - through makefile
Execute the file by runnning make
. This won't do anything in itself. For the main usage you need to specify a mode and a target.
make <mode> target=<target>
Overview of the modes
The production mode performs all automatic actions and therefore does not require any manual intervention. It queries the slack workspace, adds the new requests to the database, downloads all files and metadata, uploads the urls to archive.org and sends out the downloaded article. As a last step the newly created file is synced to the COSS-NAS.
The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable DEBUG=true
is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps.
Two additional 'modes' are build
and down
. Build rebuilds the container, which is necessary after code changes. Down ensures a clean shutdown of all containers. Usually the launch-script handles this already but it sometimes fails, in which case down
needs to be called again.
Overview of the targets
In essence a target is simply a service from docker-compose, which is run in an interactive environment. As such all services defined in docker-compose.yaml
can be called as target. Only two of them will be of real use:
news_fetch
does the majority of the actions mentioned above. By default, that is without any options, it runs a metadata-fetch, download, and upload to archive.org. The upload is usually the slowest which is why articles that are processed but don't yet have an archive.org url tend to pile up. You can therefore specify the option upload
which only starts the upload for the concerned articles, as a catch-up if you will.
Example usage:
make production target=news_fetch # full mode
make production target=news_fetch flags=upload # upload mode (lighter resource usage)
make debug target=news_fetch # debug mode, which drops you inside a new shell
make production target=news_check
news_check
starts a webapp, accessible under http://localhost:8080 and allows you to easily check the downloaded articles.
Synchronising changes with NAS
I recommend rsync
.
From within the ETH-network you can launch
make nas_sync folder=<target>
this will launch a docker container running rsync
and connected to both the COSS NAS-share and your local files. Specifying a folder restricts the files that are watched for changes.
example: make nas_sync folder=2022/September
will take significantly less time than make nas_sync folder=2022
but only considers files written to the September folder.
Please check the logs for any suspicious messages.
rsync
ing to smb shares is prone to errors.
Misc. usage:
make build # rebuilds all containers to reflect code changes
make down # shuts down all containers (usually not necessary since this occurs automatically)
make edit_profile # opens a firefox window under localhost:7900 to edit the profile used by news_fetch
make db_interfacce # opens a postgres-interface to view the remote database (localhost:8080)
First run:
The program relies on a functioning firefox profile!
For the first run ever, run
make edit_profile
This will generate a new firefox profile under coss_archiving/dependencies/news_fetch.profile
.
You can then go to http://localhost:7900 in your browser. Check the profile (under firefox://profile-internals).
Now install two addons: Idontcareaboutcookies and bypass paywalls clean (from firefox://extensions). They ensure that most sites just work out of the box. You can additionally install adblockers such as ublock origin.
You can then use this profile to further tweak various sites. The state of the sites (namely their cookies) will be used by news_fetch
.
Whenever you need to make changes to the profile, for instance re-log in to websites, just rerun
make edit_profile
.
Building
The software will change. Because the images referenced in docker compose are usually the
latest
ones, it is sufficient to update the containers.
In docker compose, run
docker compose --env-file env/production build
Or simpler, just run
make build
(should issues occur you can also run make build flags=--no-cache
)
Roadmap:
- handle paywalled sites like faz, spiegel, ... through their dedicated sites (see nexisuni.com for instance), available through the ETH network
- improve reliability of nas_sync. (+ logging)
- divide month folders into smaller ones
Appendix: (Running - Docker compose)
I strongly recommend sticking to the usage of
make
.
Instead of using the launch file you can manually issue docker compose
comands. Example: check for logs.
All relevant mounts and env-variables are easiest specified through the env-file, for which I configured 2 versions:
- production
- debug (development in general)
These files will have to be adapted to your individual setup but won't change significantly once set up.
Example usage:
docker compose --env-file env/production run news_fetch # full mode
docker compose --env-file env/production run news_fetch upload # upload mode (lighter resource usage)
docker compose --env-file env/debug run news_fetch # debug mode, which drops you inside a new shell
docker copose --env-file env/production news_check
# Misc:
docker compose --env-file env/production up # starts all services and shows their combined logs
docker compose --env-file env/production logs -f news_fetch # follows along with the logs of only one service
docker compose --env-file env/production down