diff --git a/README.md b/README.md index 8fac278..9a84016 100644 --- a/README.md +++ b/README.md @@ -1,41 +1,65 @@ -# Auto_news +# COSS_ARCHIVING -A utility to fetch article requests from slack and generate pdfs for them, fully automatically. +A utility to +* fetch article requests from slack +* generate pdfs for them +* compress them +* send them via slack + email +* upload them to the COSS NAS +... fully automatically. Run it now, thank me later. + +--- ## Running - Docker compose -A rudimentary docker compose file makes for much simpler command linde calls. +The included `docker-compose` file is now necessary for easy orchestration of the various services. -* For normal, `production` mode and `upload` mode, run: +All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions: - `docker compose --env-file env/ up` +* production +* debug (development in general) +* upload +* check + +These files will have to be adapted to your individual setup but won't change significantly once set up. + +### Overview of the modes + +The production mode performs all automatic actions and therfore does not require any manual intervention. It queries the slack workspace, adds the new requests to the database, downloads all files and metadata, uploads the urls to archive.org and sends out the downloaded article. As a last step the newly created file is synced to the COSS-NAS. + +The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable `DEBUG=true` is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps. + +The check mode is less sophisticated but shows the downloaded articles to the host for visual verification. This requires passthroughs for X11. + +Upload mode is much simpler, it goes over the exisiting database and operates on the articles, where the upload to archive.org has not yet occured (archive.org is slow and the other operations usually finish before the queue was consumed). It retries their upload. + +* For normal `production` mode run: + + `docker compose --env-file env/production up` - All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions: production, debug (development in general), upload and check. These files will have to be adapted to your individual setup but can be reused more easily. * For `debug` mode, you will likely want interactivity, so you need to run: - `docker compose --env-file env/debug up -d && docker compose --env-file env/debug exec auto_news bash && docker compose --env-file env/debug down` + `docker compose --env-file env/debug up -d && docker compose --env-file env/debug exec news_fetch bash && docker compose --env-file env/debug down` which should automatically shutdown the containers once you are done. (`ctrl+d` to exit the container shell). If not, re-run `docker compose --env-file env/debug down` manually. > Note: - > The live-mounted code is then under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. Running `python runner.py` will now run the newly written code but, with the production database and storage. + > The live-mounted code is now under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. Running `python runner.py` will now run the newly written code but, with the production database and storage. * For `check` mode, some env-variables are also changed and you still require interactivity. You don't need the geckodriver service however. The simplest way is to run -`docker compose --env-file env/check run auto_news` + `docker compose --env-file env/check run news_fetch` + +* Finally, for `upload` mode no interactivity is required and no additional services are required. Simply run: + + `docker compose --env-file env/upload run news_fetch` ## Building > The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself. -In docker, simply run: - -`docker build -t auto_news --no-cache .` - -where the `Dockerfile` has to be in the working directory - In docker compose, run `docker compose --env-file env/production build` @@ -43,10 +67,6 @@ In docker compose, run - - ## Roadmap: -:x: automatically upload files to NAS - -:x: handle paywalled sites like faz, spiegel, .. through their dedicated edu-friendly sites +[_] handle paywalled sites like faz, spiegel, ... through their dedicated sites (see nexisuni.com for instance), available through the ETH network diff --git a/docker-compose.yaml b/docker-compose.yaml index 5679a3a..224064b 100644 --- a/docker-compose.yaml +++ b/docker-compose.yaml @@ -1,12 +1,14 @@ # docker compose --env-file env/debug up version: "3.9" + services: - auto_news: - build: . - image: auto_news:latest + + news_fetch: + build: news_fetch + image: news_fetch:latest volumes: - - ${CONTAINER_DATA}:/app/file_storage + - ${CONTAINER_DATA}:/app/containerdata - ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null - ${XSOCK-/dev/null}:${XSOCK-/tmp/sock} - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority @@ -14,6 +16,7 @@ services: - DISPLAY=$DISPLAY - TERM=xterm-256color # colored logs - COLUMNS=160 # for wider logs + - DEBUG=${DEBUG} - CHECK=${CHECK} - UPLOAD=${UPLOAD} @@ -23,8 +26,9 @@ services: stdin_open: ${INTERACTIVE:-false} # docker run -i tty: ${INTERACTIVE:-false} # docker run -t + geckodriver: - image: selenium/standalone-firefox:101.0 + image: selenium/standalone-firefox:102.0.1 volumes: - ${XSOCK-/dev/null}:${XSOCK-/tmp/sock} - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority @@ -34,4 +38,34 @@ services: - START_XVFB=false user: 1001:1001 expose: # exposed to other docker-compose services only - - "4444" \ No newline at end of file + - "4444" + + + vpn: + image: wazum/openconnect-proxy:latest + env_file: + - ${CONTAINER_DATA}/config/vpn.config + cap_add: + - NET_ADMIN + volumes: + - /dev/net/tun:/dev/net/tun + # alternative to cap_add & volumes: specify privileged: true + + + nas_sync: + depends_on: + - vpn # used to establish a connection to the SMB server + network_mode: "service:vpn" + build: nas_sync + image: nas_sync:latest + cap_add: # capabilities needed for mounting the SMB share + - SYS_ADMIN + - DAC_READ_SEARCH + volumes: + - ${CONTAINER_DATA}/files:/sync/local_files + - ${CONTAINER_DATA}/config/nas_sync.config:/sync/nas_sync.config + - ${CONTAINER_DATA}/config/nas_login.config:/sync/nas_login.config + command: + - nas22.ethz.ch/gess_coss_1/helbing_support/Files RM/Archiving/TEST # fist command is the target mount path + - lsyncd + - /sync/nas_sync.config diff --git a/env/check b/env/check index d9be696..d86271c 100644 --- a/env/check +++ b/env/check @@ -1,7 +1,7 @@ # Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth -CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container -HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts +CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving +HOSTS_FILE=~/Bulk/COSS/Downloads/coss_archiving/dependencies/hosts XAUTHORTIY=$XAUTHORTIY XSOCK=/tmp/.X11-unix diff --git a/env/debug b/env/debug index 2c2ab99..ce8714e 100644 --- a/env/debug +++ b/env/debug @@ -1,7 +1,7 @@ # Runs in a debugging mode, does not launch anything at all but starts a bash process -CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container -HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts +CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving +HOSTS_FILE=~/Bulk/COSS/Downloads/coss_archiving/dependencies/hosts CODE=./ XAUTHORTIY=$XAUTHORTIY diff --git a/env/production b/env/production index a7d0b7a..0d67a27 100644 --- a/env/production +++ b/env/production @@ -1,7 +1,7 @@ # Runs on the main slack channel with the full worker setup. If nothing funky has occured, reducedfetch is a speedup -CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container -HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts +CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving +HOSTS_FILE=~/Bulk/COSS/Downloads/coss_archiving/dependencies/hosts DEBUG=false CHECK=false diff --git a/env/upload b/env/upload index 83da2ca..03055fb 100644 --- a/env/upload +++ b/env/upload @@ -1,7 +1,7 @@ # Does not run any other workers and only upploads to archive the urls that weren't previously uploaded -CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container -HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts +CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving +HOSTS_FILE=~/Bulk/COSS/Downloads/coss_archiving/dependencies/hosts DEBUG=false diff --git a/misc/hotfix_missed_messages.py b/misc/hotfix_missed_messages.py index 0ad3364..6cdb69d 100644 --- a/misc/hotfix_missed_messages.py +++ b/misc/hotfix_missed_messages.py @@ -10,7 +10,7 @@ from persistence import message_models # Constant values... -MESSAGES_DB = "/app/file_storage/messages.db" +MESSAGES_DB = "/app/containerdata/messages.db" BOT_ID = "U02MR1R8UJH" ARCHIVE_ID = "C02MM7YG1V4" diff --git a/nas_sync/Dockerfile b/nas_sync/Dockerfile new file mode 100644 index 0000000..0d71983 --- /dev/null +++ b/nas_sync/Dockerfile @@ -0,0 +1,9 @@ +FROM bash:latest +# alpine with bash instead of sh + +RUN apk add lsyncd cifs-utils +RUN mkdir -p /sync/remote_files +COPY entrypoint.sh /sync/entrypoint.sh + + +ENTRYPOINT ["bash", "/sync/entrypoint.sh"] diff --git a/nas_sync/entrypoint.sh b/nas_sync/entrypoint.sh new file mode 100644 index 0000000..2b65df4 --- /dev/null +++ b/nas_sync/entrypoint.sh @@ -0,0 +1,10 @@ +#!/bin/bash +set -e + +sleep 5 # waits for the vpn to have an established connection +echo "Starting NAS sync" +mount -t cifs "//$1" -o credentials=/sync/nas_login.config /sync/remote_files +echo "Successfully mounted SAMBA remote: $1 --> /sync/remote_files" +shift # consumes the variable set in $1 so tat $@ only contains the remaining arguments + +exec "$@" diff --git a/Dockerfile b/news_fetch/Dockerfile similarity index 95% rename from Dockerfile rename to news_fetch/Dockerfile index 0ba9459..e94b3ba 100644 --- a/Dockerfile +++ b/news_fetch/Dockerfile @@ -1,6 +1,6 @@ FROM python:latest -ENV TZ Euopre/Zurich +ENV TZ Europe/Zurich # RUN echo "deb http://deb.debian.org/debian/ unstable main contrib non-free" >> /etc/apt/sources.list # allows the installation of the latest firefox-release (debian is not usually a rolling release) @@ -35,5 +35,3 @@ RUN python3 -m pip install -r /app/requirements.txt COPY app /app/auto_news WORKDIR /app/auto_news - -ENTRYPOINT ["python3", "runner.py"] diff --git a/app/configuration.py b/news_fetch/app/configuration.py similarity index 67% rename from app/configuration.py rename to news_fetch/app/configuration.py index ff45830..5ee9503 100644 --- a/app/configuration.py +++ b/news_fetch/app/configuration.py @@ -1,7 +1,9 @@ -from ast import parse +from dataclasses import dataclass import os +import shutil import configparser import logging +from datetime import datetime from peewee import SqliteDatabase from rich.logging import RichHandler @@ -17,7 +19,7 @@ logger = logging.getLogger(__name__) # load config file containing constants and secrets parsed = configparser.ConfigParser() -parsed.read("/app/file_storage/config.ini") +parsed.read("/app/containerdata/config/news_fetch.config.ini") if os.getenv("DEBUG", "false") == "true": logger.warning("Found 'DEBUG=true', setting up dummy databases") @@ -28,8 +30,18 @@ if os.getenv("DEBUG", "false") == "true": parsed["DOWNLOADS"]["local_storage_path"] = parsed["DATABASE"]["db_path_dev"] else: logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...") - db_base_path = parsed["DATABASE"]["db_path_prod"] + logger.info("Backing up databases") + backup_dst = parsed["DATABASE"]["db_backup"] + today = datetime.today().strftime("%Y.%m.%d") + shutil.copyfile( + os.path.join(db_base_path, parsed["DATABASE"]["chat_db_name"]), + os.path.join(backup_dst, today + "." + parsed["DATABASE"]["chat_db_name"]), + ) + shutil.copyfile( + os.path.join(db_base_path, parsed["DATABASE"]["download_db_name"]), + os.path.join(backup_dst, today + "." + parsed["DATABASE"]["download_db_name"]), + ) from utils_storage import models diff --git a/app/runner.py b/news_fetch/app/runner.py similarity index 95% rename from app/runner.py rename to news_fetch/app/runner.py index 818b9ad..fdb71b9 100644 --- a/app/runner.py +++ b/news_fetch/app/runner.py @@ -14,11 +14,11 @@ from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, Up class ArticleWatcher: """Wrapper for a newly created article object. Notifies the coordinator upon change/completition""" def __init__(self, article, thread, **kwargs) -> None: + self.article_id = article.id # in case article becomes None at any point, we can still track the article self.article = article self.thread = thread self.completition_notifier = kwargs.get("notifier") - self.fetch = kwargs.get("worker_fetch", None) self.download = kwargs.get("worker_download", None) self.compress = kwargs.get("worker_compress", None) @@ -95,7 +95,8 @@ class ArticleWatcher: self._upload_completed = value self.update_status("upload") - + def __str__(self) -> str: + return f"Article with id {self.article_id}" class Coordinator(Thread): @@ -154,7 +155,7 @@ class Coordinator(Thread): for article in articles: notifier = lambda article: print(f"Completed manual actions for {article}") - ArticleWatcher(article, workers_manual = workers, notifier = notifier) + ArticleWatcher(article, None, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg def article_complete_notifier(self, article, thread): if self.worker_slack is None: diff --git a/app/utils_check/runner.py b/news_fetch/app/utils_check/runner.py similarity index 100% rename from app/utils_check/runner.py rename to news_fetch/app/utils_check/runner.py diff --git a/app/utils_mail/runner.py b/news_fetch/app/utils_mail/runner.py similarity index 100% rename from app/utils_mail/runner.py rename to news_fetch/app/utils_mail/runner.py diff --git a/app/utils_slack/message_helpers.py b/news_fetch/app/utils_slack/message_helpers.py similarity index 98% rename from app/utils_slack/message_helpers.py rename to news_fetch/app/utils_slack/message_helpers.py index 14c1d60..2c4b1f0 100644 --- a/app/utils_slack/message_helpers.py +++ b/news_fetch/app/utils_slack/message_helpers.py @@ -158,11 +158,11 @@ def fetch_missed_channel_reactions(): channel = config["archive_id"], timestamp = t.slack_ts ) - reactions = query["message"].get("reactions", []) # default = [] + reactions = query.get("message", []).get("reactions", []) # default = [] except SlackApiError: # probably a rate_limit: logger.error("Hit rate limit while querying reactions. retrying in {}s ({}/{} queries elapsed)".format(config["api_wait_time"], i, len(threads))) time.sleep(int(config["api_wait_time"])) - reactions = query["message"].get("reactions", []) + reactions = query.get("message", []).get("reactions", []) for r in reactions: reaction_dict_to_model(r, t) diff --git a/app/utils_slack/runner.py b/news_fetch/app/utils_slack/runner.py similarity index 100% rename from app/utils_slack/runner.py rename to news_fetch/app/utils_slack/runner.py diff --git a/app/utils_storage/migrations/migration.001.py b/news_fetch/app/utils_storage/migrations/migration.001.py similarity index 100% rename from app/utils_storage/migrations/migration.001.py rename to news_fetch/app/utils_storage/migrations/migration.001.py diff --git a/app/utils_storage/models.py b/news_fetch/app/utils_storage/models.py similarity index 100% rename from app/utils_storage/models.py rename to news_fetch/app/utils_storage/models.py diff --git a/app/utils_worker/_init__.py b/news_fetch/app/utils_worker/_init__.py similarity index 100% rename from app/utils_worker/_init__.py rename to news_fetch/app/utils_worker/_init__.py diff --git a/app/utils_worker/compress/runner.py b/news_fetch/app/utils_worker/compress/runner.py similarity index 100% rename from app/utils_worker/compress/runner.py rename to news_fetch/app/utils_worker/compress/runner.py diff --git a/app/utils_worker/download/__init__.py b/news_fetch/app/utils_worker/download/__init__.py similarity index 100% rename from app/utils_worker/download/__init__.py rename to news_fetch/app/utils_worker/download/__init__.py diff --git a/app/utils_worker/download/browser.py b/news_fetch/app/utils_worker/download/browser.py similarity index 100% rename from app/utils_worker/download/browser.py rename to news_fetch/app/utils_worker/download/browser.py diff --git a/app/utils_worker/download/runner.py b/news_fetch/app/utils_worker/download/runner.py similarity index 100% rename from app/utils_worker/download/runner.py rename to news_fetch/app/utils_worker/download/runner.py diff --git a/app/utils_worker/download/youtube.py b/news_fetch/app/utils_worker/download/youtube.py similarity index 81% rename from app/utils_worker/download/youtube.py rename to news_fetch/app/utils_worker/download/youtube.py index f7d5c0e..77a34ff 100644 --- a/app/utils_worker/download/youtube.py +++ b/news_fetch/app/utils_worker/download/youtube.py @@ -49,17 +49,3 @@ class YouTubeDownloader: article_object.file_name = "" return article_object - - - -# class DummyArticle: -# article_url = "https://www.welt.de/politik/ausland/article238267261/Baerbock-Lieferung-gepanzerter-Fahrzeuge-an-die-Ukraine-kein-Tabu.html" -# save_path = "/app/file_storage/" -# fname_template = "www.youtube.com -- Test" -# file_name = "" - -# m = DummyArticle() -# t = YouTubeDownloader() -# t.save_video(m) - -# print(m.file_name) diff --git a/app/utils_worker/fetch/runner.py b/news_fetch/app/utils_worker/fetch/runner.py similarity index 100% rename from app/utils_worker/fetch/runner.py rename to news_fetch/app/utils_worker/fetch/runner.py diff --git a/app/utils_worker/upload/runner.py b/news_fetch/app/utils_worker/upload/runner.py similarity index 91% rename from app/utils_worker/upload/runner.py rename to news_fetch/app/utils_worker/upload/runner.py index 5714bce..f72d6f3 100644 --- a/app/utils_worker/upload/runner.py +++ b/news_fetch/app/utils_worker/upload/runner.py @@ -12,7 +12,6 @@ def upload_to_archive(article_object): archive_url = wayback.save() # logger.info(f"{url} uploaded to archive successfully") article_object.archive_url = archive_url - # time.sleep(4) # Archive Uploads rate limited to 15/minute except Exception as e: article_object.archive_url = "Error while uploading: {}".format(e) diff --git a/app/utils_worker/worker_template.py b/news_fetch/app/utils_worker/worker_template.py similarity index 100% rename from app/utils_worker/worker_template.py rename to news_fetch/app/utils_worker/worker_template.py diff --git a/app/utils_worker/workers.py b/news_fetch/app/utils_worker/workers.py similarity index 91% rename from app/utils_worker/workers.py rename to news_fetch/app/utils_worker/workers.py index 8d46707..9526ca3 100644 --- a/app/utils_worker/workers.py +++ b/news_fetch/app/utils_worker/workers.py @@ -48,8 +48,8 @@ class UploadWorker(TemplateWorker): def _handle_article(self, article_watcher): def action(*args, **kwargs): - run_upload(*args, **kwargs) - time.sleep(5) # uploads to archive are throttled to 15/minute + time.sleep(10) # uploads to archive are throttled to 15/minute, but 5s still triggers a blacklisting + return run_upload(*args, **kwargs) super()._handle_article(article_watcher, action) article_watcher.upload_completed = True diff --git a/requirements.txt b/news_fetch/requirements.txt similarity index 100% rename from requirements.txt rename to news_fetch/requirements.txt