new component - upload to NAS

This commit is contained in:
Remy Moll 2022-07-23 17:21:00 +02:00
parent 79e3f54955
commit 8e46f30f07
29 changed files with 132 additions and 63 deletions

View File

@ -1,41 +1,65 @@
# Auto_news # COSS_ARCHIVING
A utility to fetch article requests from slack and generate pdfs for them, fully automatically. A utility to
* fetch article requests from slack
* generate pdfs for them
* compress them
* send them via slack + email
* upload them to the COSS NAS
... fully automatically. Run it now, thank me later.
---
## Running - Docker compose ## Running - Docker compose
A rudimentary docker compose file makes for much simpler command linde calls. The included `docker-compose` file is now necessary for easy orchestration of the various services.
* For normal, `production` mode and `upload` mode, run: All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions:
`docker compose --env-file env/<desired mode> up` * production
* debug (development in general)
* upload
* check
These files will have to be adapted to your individual setup but won't change significantly once set up.
### Overview of the modes
The production mode performs all automatic actions and therfore does not require any manual intervention. It queries the slack workspace, adds the new requests to the database, downloads all files and metadata, uploads the urls to archive.org and sends out the downloaded article. As a last step the newly created file is synced to the COSS-NAS.
The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable `DEBUG=true` is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps.
The check mode is less sophisticated but shows the downloaded articles to the host for visual verification. This requires passthroughs for X11.
Upload mode is much simpler, it goes over the exisiting database and operates on the articles, where the upload to archive.org has not yet occured (archive.org is slow and the other operations usually finish before the queue was consumed). It retries their upload.
* For normal `production` mode run:
`docker compose --env-file env/production up`
All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions: production, debug (development in general), upload and check. These files will have to be adapted to your individual setup but can be reused more easily.
* For `debug` mode, you will likely want interactivity, so you need to run: * For `debug` mode, you will likely want interactivity, so you need to run:
`docker compose --env-file env/debug up -d && docker compose --env-file env/debug exec auto_news bash && docker compose --env-file env/debug down` `docker compose --env-file env/debug up -d && docker compose --env-file env/debug exec news_fetch bash && docker compose --env-file env/debug down`
which should automatically shutdown the containers once you are done. (`ctrl+d` to exit the container shell). If not, re-run `docker compose --env-file env/debug down` manually. which should automatically shutdown the containers once you are done. (`ctrl+d` to exit the container shell). If not, re-run `docker compose --env-file env/debug down` manually.
> Note: > Note:
> The live-mounted code is then under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. Running `python runner.py` will now run the newly written code but, with the production database and storage. > The live-mounted code is now under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. Running `python runner.py` will now run the newly written code but, with the production database and storage.
* For `check` mode, some env-variables are also changed and you still require interactivity. You don't need the geckodriver service however. The simplest way is to run * For `check` mode, some env-variables are also changed and you still require interactivity. You don't need the geckodriver service however. The simplest way is to run
`docker compose --env-file env/check run auto_news` `docker compose --env-file env/check run news_fetch`
* Finally, for `upload` mode no interactivity is required and no additional services are required. Simply run:
`docker compose --env-file env/upload run news_fetch`
## Building ## Building
> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself. > The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.
In docker, simply run:
`docker build -t auto_news --no-cache .`
where the `Dockerfile` has to be in the working directory
In docker compose, run In docker compose, run
`docker compose --env-file env/production build` `docker compose --env-file env/production build`
@ -43,10 +67,6 @@ In docker compose, run
## Roadmap: ## Roadmap:
:x: automatically upload files to NAS [_] handle paywalled sites like faz, spiegel, ... through their dedicated sites (see nexisuni.com for instance), available through the ETH network
:x: handle paywalled sites like faz, spiegel, .. through their dedicated edu-friendly sites

View File

@ -1,12 +1,14 @@
# docker compose --env-file env/debug up # docker compose --env-file env/debug up
version: "3.9" version: "3.9"
services: services:
auto_news:
build: . news_fetch:
image: auto_news:latest build: news_fetch
image: news_fetch:latest
volumes: volumes:
- ${CONTAINER_DATA}:/app/file_storage - ${CONTAINER_DATA}:/app/containerdata
- ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null - ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
- ${XSOCK-/dev/null}:${XSOCK-/tmp/sock} - ${XSOCK-/dev/null}:${XSOCK-/tmp/sock}
- ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority
@ -14,6 +16,7 @@ services:
- DISPLAY=$DISPLAY - DISPLAY=$DISPLAY
- TERM=xterm-256color # colored logs - TERM=xterm-256color # colored logs
- COLUMNS=160 # for wider logs - COLUMNS=160 # for wider logs
- DEBUG=${DEBUG} - DEBUG=${DEBUG}
- CHECK=${CHECK} - CHECK=${CHECK}
- UPLOAD=${UPLOAD} - UPLOAD=${UPLOAD}
@ -23,8 +26,9 @@ services:
stdin_open: ${INTERACTIVE:-false} # docker run -i stdin_open: ${INTERACTIVE:-false} # docker run -i
tty: ${INTERACTIVE:-false} # docker run -t tty: ${INTERACTIVE:-false} # docker run -t
geckodriver: geckodriver:
image: selenium/standalone-firefox:101.0 image: selenium/standalone-firefox:102.0.1
volumes: volumes:
- ${XSOCK-/dev/null}:${XSOCK-/tmp/sock} - ${XSOCK-/dev/null}:${XSOCK-/tmp/sock}
- ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority
@ -34,4 +38,34 @@ services:
- START_XVFB=false - START_XVFB=false
user: 1001:1001 user: 1001:1001
expose: # exposed to other docker-compose services only expose: # exposed to other docker-compose services only
- "4444" - "4444"
vpn:
image: wazum/openconnect-proxy:latest
env_file:
- ${CONTAINER_DATA}/config/vpn.config
cap_add:
- NET_ADMIN
volumes:
- /dev/net/tun:/dev/net/tun
# alternative to cap_add & volumes: specify privileged: true
nas_sync:
depends_on:
- vpn # used to establish a connection to the SMB server
network_mode: "service:vpn"
build: nas_sync
image: nas_sync:latest
cap_add: # capabilities needed for mounting the SMB share
- SYS_ADMIN
- DAC_READ_SEARCH
volumes:
- ${CONTAINER_DATA}/files:/sync/local_files
- ${CONTAINER_DATA}/config/nas_sync.config:/sync/nas_sync.config
- ${CONTAINER_DATA}/config/nas_login.config:/sync/nas_login.config
command:
- nas22.ethz.ch/gess_coss_1/helbing_support/Files RM/Archiving/TEST # fist command is the target mount path
- lsyncd
- /sync/nas_sync.config

4
env/check vendored
View File

@ -1,7 +1,7 @@
# Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth # Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth
CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts HOSTS_FILE=~/Bulk/COSS/Downloads/coss_archiving/dependencies/hosts
XAUTHORTIY=$XAUTHORTIY XAUTHORTIY=$XAUTHORTIY
XSOCK=/tmp/.X11-unix XSOCK=/tmp/.X11-unix

4
env/debug vendored
View File

@ -1,7 +1,7 @@
# Runs in a debugging mode, does not launch anything at all but starts a bash process # Runs in a debugging mode, does not launch anything at all but starts a bash process
CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts HOSTS_FILE=~/Bulk/COSS/Downloads/coss_archiving/dependencies/hosts
CODE=./ CODE=./
XAUTHORTIY=$XAUTHORTIY XAUTHORTIY=$XAUTHORTIY

4
env/production vendored
View File

@ -1,7 +1,7 @@
# Runs on the main slack channel with the full worker setup. If nothing funky has occured, reducedfetch is a speedup # Runs on the main slack channel with the full worker setup. If nothing funky has occured, reducedfetch is a speedup
CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts HOSTS_FILE=~/Bulk/COSS/Downloads/coss_archiving/dependencies/hosts
DEBUG=false DEBUG=false
CHECK=false CHECK=false

4
env/upload vendored
View File

@ -1,7 +1,7 @@
# Does not run any other workers and only upploads to archive the urls that weren't previously uploaded # Does not run any other workers and only upploads to archive the urls that weren't previously uploaded
CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts HOSTS_FILE=~/Bulk/COSS/Downloads/coss_archiving/dependencies/hosts
DEBUG=false DEBUG=false

View File

@ -10,7 +10,7 @@ from persistence import message_models
# Constant values... # Constant values...
MESSAGES_DB = "/app/file_storage/messages.db" MESSAGES_DB = "/app/containerdata/messages.db"
BOT_ID = "U02MR1R8UJH" BOT_ID = "U02MR1R8UJH"
ARCHIVE_ID = "C02MM7YG1V4" ARCHIVE_ID = "C02MM7YG1V4"

9
nas_sync/Dockerfile Normal file
View File

@ -0,0 +1,9 @@
FROM bash:latest
# alpine with bash instead of sh
RUN apk add lsyncd cifs-utils
RUN mkdir -p /sync/remote_files
COPY entrypoint.sh /sync/entrypoint.sh
ENTRYPOINT ["bash", "/sync/entrypoint.sh"]

10
nas_sync/entrypoint.sh Normal file
View File

@ -0,0 +1,10 @@
#!/bin/bash
set -e
sleep 5 # waits for the vpn to have an established connection
echo "Starting NAS sync"
mount -t cifs "//$1" -o credentials=/sync/nas_login.config /sync/remote_files
echo "Successfully mounted SAMBA remote: $1 --> /sync/remote_files"
shift # consumes the variable set in $1 so tat $@ only contains the remaining arguments
exec "$@"

View File

@ -1,6 +1,6 @@
FROM python:latest FROM python:latest
ENV TZ Euopre/Zurich ENV TZ Europe/Zurich
# RUN echo "deb http://deb.debian.org/debian/ unstable main contrib non-free" >> /etc/apt/sources.list # RUN echo "deb http://deb.debian.org/debian/ unstable main contrib non-free" >> /etc/apt/sources.list
# allows the installation of the latest firefox-release (debian is not usually a rolling release) # allows the installation of the latest firefox-release (debian is not usually a rolling release)
@ -35,5 +35,3 @@ RUN python3 -m pip install -r /app/requirements.txt
COPY app /app/auto_news COPY app /app/auto_news
WORKDIR /app/auto_news WORKDIR /app/auto_news
ENTRYPOINT ["python3", "runner.py"]

View File

@ -1,7 +1,9 @@
from ast import parse from dataclasses import dataclass
import os import os
import shutil
import configparser import configparser
import logging import logging
from datetime import datetime
from peewee import SqliteDatabase from peewee import SqliteDatabase
from rich.logging import RichHandler from rich.logging import RichHandler
@ -17,7 +19,7 @@ logger = logging.getLogger(__name__)
# load config file containing constants and secrets # load config file containing constants and secrets
parsed = configparser.ConfigParser() parsed = configparser.ConfigParser()
parsed.read("/app/file_storage/config.ini") parsed.read("/app/containerdata/config/news_fetch.config.ini")
if os.getenv("DEBUG", "false") == "true": if os.getenv("DEBUG", "false") == "true":
logger.warning("Found 'DEBUG=true', setting up dummy databases") logger.warning("Found 'DEBUG=true', setting up dummy databases")
@ -28,8 +30,18 @@ if os.getenv("DEBUG", "false") == "true":
parsed["DOWNLOADS"]["local_storage_path"] = parsed["DATABASE"]["db_path_dev"] parsed["DOWNLOADS"]["local_storage_path"] = parsed["DATABASE"]["db_path_dev"]
else: else:
logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...") logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
db_base_path = parsed["DATABASE"]["db_path_prod"] db_base_path = parsed["DATABASE"]["db_path_prod"]
logger.info("Backing up databases")
backup_dst = parsed["DATABASE"]["db_backup"]
today = datetime.today().strftime("%Y.%m.%d")
shutil.copyfile(
os.path.join(db_base_path, parsed["DATABASE"]["chat_db_name"]),
os.path.join(backup_dst, today + "." + parsed["DATABASE"]["chat_db_name"]),
)
shutil.copyfile(
os.path.join(db_base_path, parsed["DATABASE"]["download_db_name"]),
os.path.join(backup_dst, today + "." + parsed["DATABASE"]["download_db_name"]),
)
from utils_storage import models from utils_storage import models

View File

@ -14,11 +14,11 @@ from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, Up
class ArticleWatcher: class ArticleWatcher:
"""Wrapper for a newly created article object. Notifies the coordinator upon change/completition""" """Wrapper for a newly created article object. Notifies the coordinator upon change/completition"""
def __init__(self, article, thread, **kwargs) -> None: def __init__(self, article, thread, **kwargs) -> None:
self.article_id = article.id # in case article becomes None at any point, we can still track the article
self.article = article self.article = article
self.thread = thread self.thread = thread
self.completition_notifier = kwargs.get("notifier") self.completition_notifier = kwargs.get("notifier")
self.fetch = kwargs.get("worker_fetch", None) self.fetch = kwargs.get("worker_fetch", None)
self.download = kwargs.get("worker_download", None) self.download = kwargs.get("worker_download", None)
self.compress = kwargs.get("worker_compress", None) self.compress = kwargs.get("worker_compress", None)
@ -95,7 +95,8 @@ class ArticleWatcher:
self._upload_completed = value self._upload_completed = value
self.update_status("upload") self.update_status("upload")
def __str__(self) -> str:
return f"Article with id {self.article_id}"
class Coordinator(Thread): class Coordinator(Thread):
@ -154,7 +155,7 @@ class Coordinator(Thread):
for article in articles: for article in articles:
notifier = lambda article: print(f"Completed manual actions for {article}") notifier = lambda article: print(f"Completed manual actions for {article}")
ArticleWatcher(article, workers_manual = workers, notifier = notifier) ArticleWatcher(article, None, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg
def article_complete_notifier(self, article, thread): def article_complete_notifier(self, article, thread):
if self.worker_slack is None: if self.worker_slack is None:

View File

@ -158,11 +158,11 @@ def fetch_missed_channel_reactions():
channel = config["archive_id"], channel = config["archive_id"],
timestamp = t.slack_ts timestamp = t.slack_ts
) )
reactions = query["message"].get("reactions", []) # default = [] reactions = query.get("message", []).get("reactions", []) # default = []
except SlackApiError: # probably a rate_limit: except SlackApiError: # probably a rate_limit:
logger.error("Hit rate limit while querying reactions. retrying in {}s ({}/{} queries elapsed)".format(config["api_wait_time"], i, len(threads))) logger.error("Hit rate limit while querying reactions. retrying in {}s ({}/{} queries elapsed)".format(config["api_wait_time"], i, len(threads)))
time.sleep(int(config["api_wait_time"])) time.sleep(int(config["api_wait_time"]))
reactions = query["message"].get("reactions", []) reactions = query.get("message", []).get("reactions", [])
for r in reactions: for r in reactions:
reaction_dict_to_model(r, t) reaction_dict_to_model(r, t)

View File

@ -49,17 +49,3 @@ class YouTubeDownloader:
article_object.file_name = "" article_object.file_name = ""
return article_object return article_object
# class DummyArticle:
# article_url = "https://www.welt.de/politik/ausland/article238267261/Baerbock-Lieferung-gepanzerter-Fahrzeuge-an-die-Ukraine-kein-Tabu.html"
# save_path = "/app/file_storage/"
# fname_template = "www.youtube.com -- Test"
# file_name = ""
# m = DummyArticle()
# t = YouTubeDownloader()
# t.save_video(m)
# print(m.file_name)

View File

@ -12,7 +12,6 @@ def upload_to_archive(article_object):
archive_url = wayback.save() archive_url = wayback.save()
# logger.info(f"{url} uploaded to archive successfully") # logger.info(f"{url} uploaded to archive successfully")
article_object.archive_url = archive_url article_object.archive_url = archive_url
# time.sleep(4) # Archive Uploads rate limited to 15/minute
except Exception as e: except Exception as e:
article_object.archive_url = "Error while uploading: {}".format(e) article_object.archive_url = "Error while uploading: {}".format(e)

View File

@ -48,8 +48,8 @@ class UploadWorker(TemplateWorker):
def _handle_article(self, article_watcher): def _handle_article(self, article_watcher):
def action(*args, **kwargs): def action(*args, **kwargs):
run_upload(*args, **kwargs) time.sleep(10) # uploads to archive are throttled to 15/minute, but 5s still triggers a blacklisting
time.sleep(5) # uploads to archive are throttled to 15/minute return run_upload(*args, **kwargs)
super()._handle_article(article_watcher, action) super()._handle_article(article_watcher, action)
article_watcher.upload_completed = True article_watcher.upload_completed = True