Working, refactored news_fetch, better documentation for launch

This commit is contained in:
Remy Moll 2022-09-08 16:19:15 +02:00
parent 713406dc67
commit afead44d6c
14 changed files with 220 additions and 247 deletions

View File

@ -11,18 +11,15 @@ A utility to
... fully automatically. Run it now, thank me later.
---
## Running - Docker compose
The included `docker-compose` file is now necessary for easy orchestration of the various services.
## Running - through launch file
> Prerequisite: make `launch.cexecutable:
>
> `chmod +x launch`
All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions:
Execute the file by runnning `./launch`. This won't do anything in itself. You need to specify a mode, and then a command
* production
* debug (development in general)
* upload
* check
These files will have to be adapted to your individual setup but won't change significantly once set up.
`./launch <mode> <command> <command options>`
### Overview of the modes
@ -30,47 +27,67 @@ The production mode performs all automatic actions and therfore does not require
The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable `DEBUG=true` is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps.
The check mode is less sophisticated but shows the downloaded articles to the host for visual verification. This requires passthroughs for X11.
Upload mode is much simpler, it goes over the exisiting database and operates on the articles, where the upload to archive.org has not yet occured (archive.org is slow and the other operations usually finish before the queue was consumed). It retries their upload.
* For normal `production` mode run:
`docker compose --env-file env/production run news_fetch`
Two additional 'modes' are `build` and `down`. Build rebuilds the container, which is necessary after code changes. Down ensures a clean shutdown of *all* containers. Usually the launch-script handles this already but it sometimes fails, in which case `down` needs to be called again.
* For `debug` mode run:
### Overview of the commands
`docker compose --env-file env/debug run news_fetch`
In essence a command is simply a service from docker-compose, which is run in an interactive environment. As such all services defined in `docker-compose.yaml` can be called as commands. Only two of them will be of real use:
which drops you into an interactive shell (`ctrl+d` to exit the container shell).
`news_fetch` does the majority of the actions mentionned above. By default, that is without any options, it runs a metadata-fetch, download, compression, and upload to archive.org. The upload is usually the slowest which is why articles that are processed but don't yet have an archive.org url tend to pile up. You can therefore specify the option `upload` which only starts the upload for the concerned articles, as a catch-up if you will.
> Note:
> The live-mounted code is now under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. Running `python runner.py` will now run the newly written code but, with the production database and storage.
Example usage:
* For `check` mode, some env-variables are also changed and you still require interactivity. You don't need the geckodriver service however. The simplest way is to run
```bash
./launch production news_fetch # full mode
./launch production news_fetch upload # upload mode (lighter resource usage)
./launch debug news_fetch # debug mode, which drops you inside a new shell
`docker compose --env-file env/check run --no-deps --rm news_fetch`
./launch production news_check
```
* Finally, for `upload` mode no interactivity is required and no additional services are required. Simply run:
`news_check` starts a webapp, accessible under [http://localhost:8080](http://localhost:8080) and allows you to easily check the downloaded articles.
`docker compose --env-file env/upload run --no-deps --rm news_fetch`
### Stopping
Run
## (Running - Docker compose)
> I strongly recommend sticking to the usage of `./launch`.
`docker compose --env-file env/production down`
Instead of using the launch file you can manually issue `docker compose` comands. Example: check for logs.
All relevant mounts and env-variables are easiest specified through the env-file, for which I configured 2 versions:
* production
* debug (development in general)
These files will have to be adapted to your individual setup but won't change significantly once set up.
Example usage:
```bash
docker compose --env-file env/production run news_fetch # full mode
docker compose --env-file env/production run news_fetch upload # upload mode (lighter resource usage)
docker compose --env-file env/debug run news_fetch # debug mode, which drops you inside a new shell
docker copose --env-file env/production news_check
# Misc:
docker compose --env-file env/production up # starts all services and shows their combined logs
docker compose --env-file env/production logs -f news_fetch # follows along with the logs of only one service
docker compose --env-file env/production down
```
which terminates all containers associated with the `docker-compose.yaml`.
## Building
> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.
> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly re build the docker image! This is also crucial to update the code itself.
In docker compose, run
`docker compose --env-file env/production build`
Or simpler, just run
`./launch build`
@ -80,6 +97,10 @@ In docker compose, run
## Manual Sync to NAS:
Manual sync is sadly still necessary, as the lsync client, sometimes gets overwhelmed by quick writes.
I use `rsync`. Mounting the NAS locally, I navigate to the location of the local folder (notice the trailing slash). Then run
`rsync -Razq --no-perms --no-owner --no-group --temp-dir=/tmp --progress --log-file=rsync.log <local folder>/ "<remote>"`
where `<remote>` is the location where the NAS is mounted. (options:`R` - relative paths , `a` - archive mode (multiple actions), `z` - ??, `q` - quiet. We also don't copy most of the metadata and we keep a log of the transfers.)
You can also use your OS' native copy option and select *de not overwrite*. This should only copy the missing files, significantly speeding up the operation.

View File

@ -38,6 +38,7 @@ services:
environment:
- START_VNC=${HEADFULL-false} # as opposed to headless, used when requiring supervision (eg. for websites that crash)
- START_XVFB=${HEADFULL-false}
- SE_VNC_NO_PASSWORD=1
expose: ["4444"] # exposed to other docker-compose services only
ports:
- 7900:7900 # port for webvnc

15
env/check vendored
View File

@ -1,15 +0,0 @@
# Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
XAUTHORTIY=$XAUTHORTIY
XSOCK=/tmp/.X11-unix
DEBUG=false
CHECK=true
HEADLESS=true
UPLOAD=false
REDUCEDFETCH=false
# ENTRYPOINT="/bin/bash"
INTERACTIVE=true

18
env/debug vendored
View File

@ -1,14 +1,10 @@
# Runs in a debugging mode, does not launch anything at all but starts a bash process
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
export UNAME=remy
CODE=./
DEBUG=true
CHECK=false
UPLOAD=false
HEADLESS=false
REDUCEDFETCH=false
ENTRYPOINT="/bin/bash"
INTERACTIVE=true
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
export DEBUG=true
export HEADFULL=true
export CODE=./
export ENTRYPOINT=/bin/bash

9
env/production vendored
View File

@ -2,9 +2,6 @@
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
CONTAINERS_TO_RUN=nas_sync, geckodriver
DEBUG=false
CHECK=false
UPLOAD=false
HEADLESS=true
REDUCEDFETCH=true
export UNAME=remy
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
export DEBUG=false

10
env/upload vendored
View File

@ -1,10 +0,0 @@
# Does not run any other workers and only upploads to archive the urls that weren't previously uploaded
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
NEWS_FETCH_DEPENDS_ON="[]"
DEBUG=false
CHECK=false
UPLOAD=true
HEADLESS=true
REDUCEDFETCH=false

2
launch
View File

@ -9,7 +9,7 @@ echo "Bash script launching COSS_ARCHIVING..."
export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
export UNAME=remy
# CHANGE ME WHEN UPDATING FIREFOX
export GECKODRIVER_IMG=selenium/standalone-firefox:103.0
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
# version must be >= than the one on the host or firefox will not start (because of mismatched config)
if [[ $1 == "debug" ]]

View File

@ -25,7 +25,7 @@
<td>{ item.name }</td>
<!-- <td>Quality Control Specialist</td> -->
{#if item.value != ""}
<td class='bg-emerald-200' style="white-space: normal">{ item.value }</td>
<td class='bg-emerald-200' style="white-space: normal; width:70%">{ item.value }</td>
{:else}
<td class='bg-red-200'>{ item.value }</td>
{/if}

View File

@ -2,6 +2,8 @@ FROM python:latest
ENV TZ Europe/Zurich
RUN apt-get update && apt-get install -y ghostscript
# for compression of pdfs
RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
# id mapped to local user

View File

@ -1,7 +1,8 @@
import os
import shutil
import configparser
import logging
import time
import shutil
from datetime import datetime
from peewee import SqliteDatabase, PostgresqlDatabase
from rich.logging import RichHandler
@ -41,6 +42,7 @@ if os.getenv("DEBUG", "false") == "true":
else:
logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
time.sleep(10) # wait for the vpn to connect (can't use a healthcheck because there is no depends_on)
cred = db_config["DATABASE"]
download_db = PostgresqlDatabase(
cred["db_name"], user=cred["user_name"], password=cred["password"], host="vpn", port=5432

View File

@ -3,125 +3,91 @@ import configuration
models = configuration.models
from threading import Thread
import logging
import sys
logger = logging.getLogger(__name__)
import sys
from collections import OrderedDict
from utils_mail import runner as mail_runner
from utils_slack import runner as slack_runner
from utils_mail import runner as MailRunner
from utils_slack import runner as SlackRunner
from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, UploadWorker
class ArticleWatcher:
"""Wrapper for a newly created article object. Notifies the coordinator upon change/completition"""
def __init__(self, article, **kwargs) -> None:
self.article_id = article.id # in case article becomes None at any point, we can still track the article
def __init__(self, article, workers_in, workers_out) -> None:
self.article = article
self.completition_notifier = kwargs.get("notifier")
self.fetch = kwargs.get("worker_fetch", None)
self.download = kwargs.get("worker_download", None)
self.compress = kwargs.get("worker_compress", None)
self.upload = kwargs.get("worker_upload", None)
self.workers_in = workers_in
self.workers_out = workers_out
self.completition_notified = False
# self._download_called = self._compression_called = False
self._fetch_completed = self._download_completed = self._compression_completed = self._upload_completed = False
# first step: gather metadata
if self.fetch and self.upload:
self.fetch.process(self) # this will call the update_status method
self.upload.process(self) # idependent from the rest
else: # the full kwargs were not provided, only do a manual run
# overwrite update_status() because calls from the workers will result in erros
self.update_status = lambda completed: logger.info(f"Completed action {completed}")
for w in kwargs.get("workers_manual"):
w.process(self)
for w_dict in self.workers_in:
worker = self.get_next_worker(w_dict) # gets the first worker of each dict (they get processed independently)
worker.process(self)
def update_status(self, completed_action):
"""Checks and notifies internal completition-status.
Article download is complete iff fetch and download were successfull and compression was run
"""
# if self.completition_notified and self._compression_completed and self._fetch_completed and self._download_completed and self._upload_completed, we are done
if completed_action == "fetch":
self.download.process(self)
elif completed_action == "download":
self.compress.process(self)
elif completed_action == "compress": # last step
self.completition_notifier(self.article)
# triggers action in Coordinator
elif completed_action == "upload":
# this case occurs when upload was faster than compression
pass
else:
logger.warning(f"update_status called with unusual configuration: {completed_action}")
def get_next_worker(self, worker_dict, worker_name=""):
"""Returns the worker coming after the one with key worker_name"""
if worker_name == "": # first one
return worker_dict[list(worker_dict.keys())[0]]
# for i,w_dict in enumerate(workers_list):
keys = list(worker_dict.keys())
next_key_ind = keys.index(worker_name) + 1
try:
key = keys[next_key_ind]
return worker_dict[key]
except IndexError:
return None
# ====== Attributes to be modified by the util workers
@property
def fetch_completed(self):
return self._fetch_completed
def update(self, worker_name):
"""Called by the workers to notify the watcher of a completed step"""
for w_dict in self.workers_in:
if worker_name in w_dict.keys():
next_worker = self.get_next_worker(w_dict, worker_name)
if next_worker:
if next_worker == "out":
self.completion_notifier()
else: # it's just another in-worker
next_worker.process(self)
else: # no next worker, we are done
logger.info(f"No worker after {worker_name}")
@fetch_completed.setter
def fetch_completed(self, value: bool):
self._fetch_completed = value
self.update_status("fetch")
@property
def download_completed(self):
return self._download_completed
def completion_notifier(self):
"""Triggers the out-workers to process the article, that is to send out a message"""
for w_dict in self.workers_out:
worker = self.get_next_worker(w_dict)
worker.send(self.article)
self.article.sent = True
self.article.save()
@download_completed.setter
def download_completed(self, value: bool):
self._download_completed = value
self.update_status("download")
@property
def compression_completed(self):
return self._compression_completed
@compression_completed.setter
def compression_completed(self, value: bool):
self._compression_completed = value
self.update_status("compress")
@property
def upload_completed(self):
return self._upload_completed
@upload_completed.setter
def upload_completed(self, value: bool):
self._upload_completed = value
self.update_status("upload")
def __str__(self) -> str:
return f"Article with id {self.article_id}"
return f"ArticleWatcher with id {self.article_id}"
class Coordinator(Thread):
def __init__(self, **kwargs) -> None:
"""Launcher calls this Coordinator as the main thread to handle connections between the other workers (threaded)."""
super().__init__(target = self.launch, daemon=True)
def add_workers(self, **kwargs):
self.worker_slack = kwargs.pop("worker_slack", None)
self.worker_mail = kwargs.pop("worker_mail", None)
# the two above won't be needed in the Watcher
self.worker_download = kwargs.get("worker_download", None)
self.worker_fetch = kwargs.get("worker_fetch", None)
self.worker_compress = kwargs.get("worker_compress", None)
self.worker_upload = kwargs.get("worker_upload", None)
class Dispatcher(Thread):
def __init__(self) -> None:
"""Thread to handle handle incoming requests and control the workers"""
self.workers_in = []
self.workers_out = []
super().__init__(target = self.launch)
self.kwargs = kwargs
def launch(self) -> None:
for w in [self.worker_download, self.worker_fetch, self.worker_upload, self.worker_compress]:
if not w is None: # for reduced operations such as upload, some workers are set to None
w.start()
# start workers (each worker is a thread)
for w_dict in self.workers_in: # for reduced operations such as upload, some workers are not set
for w in w_dict.values():
if isinstance(w, Thread):
w.start()
# if past messages have not been sent, they must be reevaluated
unsent = models.ArticleDownload.filter(sent = False)
# .objects.filter(sent = False)
# get all articles not fully processed
unsent = models.ArticleDownload.filter(sent = False) # if past messages have not been sent, they must be reevaluated
for a in unsent:
self.incoming_request(article=a)
@ -136,82 +102,82 @@ class Coordinator(Thread):
return
article, is_new = models.ArticleDownload.get_or_create(article_url=url)
article.slack_ts = message.ts # either update the timestamp (to the last reference to the article) or set it for the first time
article.save()
elif article is not None:
is_new = False
logger.info(f"Received article {article} in incoming_request")
else:
logger.error("Coordinator.incoming_request called with no arguments")
logger.error("Dispatcher.incoming_request called with no arguments")
return
self.kwargs.update({"notifier" : self.article_complete_notifier})
if is_new or (article.file_name == "" and article.verified == 0):
# check for models that were created but were abandonned. This means they have missing information, most importantly no associated file
# this overwrites previously set information, but that should not be too important
ArticleWatcher(
article,
**self.kwargs
workers_in=self.workers_in,
workers_out=self.workers_out,
)
# All workers are implemented as a threaded queue. But the individual model requires a specific processing order:
# fetch -> download -> compress -> complete
# the watcher orchestrates the procedure and notifies upon completition
# the watcher will notify once it is sufficiently populated
else: # manually trigger notification immediatly
logger.info(f"Found existing article {article}. Now sending")
self.article_complete_notifier(article)
def manual_processing(self, articles, workers):
for w in workers:
w.start()
# def manual_processing(self, articles, workers):
# for w in workers:
# w.start()
for article in articles:
notifier = lambda article: logger.info(f"Completed manual actions for {article}")
ArticleWatcher(article, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg
def article_complete_notifier(self, article):
if self.worker_slack is None:
logger.warning("Skipping slack notification because worker is None")
else:
self.worker_slack.bot_worker.respond_channel_message(article)
if self.worker_mail is None:
logger.warning("Skipping mail notification because worker is None")
else:
self.worker_mail.send(article)
article.sent = True
article.save()
# for article in articles:
# notifier = lambda article: logger.info(f"Completed manual actions for {article}")
# ArticleWatcher(article, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg
if __name__ == "__main__":
coordinator = Coordinator()
dispatcher = Dispatcher()
if "upload" in sys.argv:
class PrintWorker:
def send(self, article):
print(f"Uploaded article {article}")
articles = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "" or models.ArticleDownload.archive_url == "TODO:UPLOAD").execute()
logger.info(f"Launching upload to archive for {len(articles)} articles.")
coordinator.manual_processing(articles, [UploadWorker()])
dispatcher.workers_in = [{"UploadWorker": UploadWorker()}]
dispatcher.workers_out = [{"PrintWorker": PrintWorker()}]
dispatcher.start()
else: # launch with full action
slack_runner = slack_runner.BotRunner(coordinator.incoming_request)
kwargs = {
"worker_download" : DownloadWorker(),
"worker_fetch" : FetchWorker(),
"worker_upload" : UploadWorker(),
"worker_compress" : CompressWorker(),
"worker_slack" : slack_runner,
"worker_mail" : mail_runner,
}
try:
coordinator.add_workers(**kwargs)
coordinator.start()
slack_runner = SlackRunner.BotRunner(dispatcher.incoming_request)
# All workers are implemented as a threaded queue. But the individual model requires a specific processing order:
# fetch -> download -> compress -> complete
# This is reflected in the following list of workers:
workers_in = [
OrderedDict({"FetchWorker": FetchWorker(), "DownloadWorker": DownloadWorker(), "CompressWorker": CompressWorker(), "NotifyRunner": "out"}),
OrderedDict({"UploadWorker": UploadWorker()})
]
# The two dicts are processed independently. First element of first dict is called at the same time as the first element of the second dict
# Inside a dict, the order of the keys gives the order of execution (only when the first element is done, the second is called, etc...)
workers_out = [{"SlackRunner": slack_runner},{"MailRunner": MailRunner}]
dispatcher.workers_in = workers_in
dispatcher.workers_out = workers_out
dispatcher.start() # starts the thread, (ie. runs launch())
slack_runner.start() # last one to start, inside the main thread
except KeyboardInterrupt:
logger.info("Keyboard interrupt. Stopping Slack and Coordinator")
logger.info("Keyboard interrupt. Stopping Slack and dispatcher")
slack_runner.stop()
logger.info("BYE!")
# coordinator was set as a daemon thread, so it will be stopped automatically
dispatcher.join()
for w_dict in workers_in:
for w in w_dict.values():
if isinstance(w, Thread):
w.stop()
# All threads are launched as a daemon thread, meaning that any 'leftover' should exit along with the sys call
sys.exit(0)

View File

@ -157,29 +157,34 @@ class BotApp(App):
if say is None:
say = self.say_substitute
answers = article.slack_info
for a in answers:
if a["file_path"]:
try:
self.client.files_upload(
channels = config["archive_id"],
initial_comment = f"{a['reply_text']}",
file = a["file_path"],
thread_ts = article.slack_ts_full
)
status = True
except SlackApiError as e: # upload resulted in an error
if article.slack_ts == 0:
self.logger.error(f"{article} has no slack_ts")
else:
self.logger.info("Skipping slack reply because it is broken")
for a in []:
# for a in answers:
if a["file_path"]:
try:
self.client.files_upload(
channels = config["archive_id"],
initial_comment = f"{a['reply_text']}",
file = a["file_path"],
thread_ts = article.slack_ts_full
)
# status = True
except SlackApiError as e: # upload resulted in an error
say(
"File {} could not be uploaded.".format(a),
thread_ts = article.slack_ts_full
)
# status = False
self.logger.error(f"File upload failed: {e}")
else: # anticipated that there is no file!
say(
"File {} could not be uploaded.".format(a),
f"{a['reply_text']}",
thread_ts = article.slack_ts_full
)
status = False
self.logger.error(f"File upload failed: {e}")
else: # anticipated that there is no file!
say(
f"{a['reply_text']}",
thread_ts = article.slack_ts_full
)
status = True
# status = True
def startup_status(self):
@ -230,6 +235,9 @@ class BotRunner():
self.logger.info("Closed Slack-Socketmodehandler")
def send(self, article):
"""Proxy function to send a message to the slack channel, Called by ArticleWatcher once the Article is ready"""
self.bot_worker.respond_channel_message(article)

View File

@ -7,12 +7,10 @@ class TemplateWorker(Thread):
"""Parent class for any subsequent worker of the article-download pipeline. They should all run in parallel, thus the Thread subclassing"""
logger = logging.getLogger(__name__)
def __init__(self, *args, **kwargs) -> None:
def __init__(self, **kwargs) -> None:
target = self._queue_processor # will be executed on Worker.start()
group = kwargs.get("group", None)
name = kwargs.get("name", None)
super().__init__(group=group, target=target, name=name)
self.keep_running = True
super().__init__(target=target, daemon=True)
self._article_queue = []
self.logger.info(f"Worker thread {self.__class__.__name__} initialized successfully")
@ -23,7 +21,7 @@ class TemplateWorker(Thread):
def _queue_processor(self):
"""This method is launched by thread.run() and idles when self._article_queue is empty. When an external caller appends to the queue it jumps into action"""
while True: # PLEASE tell me if I'm missing an obvious better way of doing this!
while self.keep_running: # PLEASE tell me if I'm missing an obvious better way of doing this!
if len(self._article_queue) == 0:
time.sleep(5)
else:
@ -39,3 +37,10 @@ class TemplateWorker(Thread):
article = article_watcher.article
article = action(article) # action updates the article object but does not save the change
article.save()
article_watcher.update(self.__class__.__name__)
def stop(self):
self.logger.info(f"Stopping worker {self.__class__.__name__} whith {len(self._article_queue)} articles left in queue")
self.keep_running = False
self.join()

View File

@ -25,7 +25,7 @@ class DownloadWorker(TemplateWorker):
action = self.dl_runner
super()._handle_article(article_watcher, action)
article_watcher.download_completed = True
# article_watcher.download_completed = True
@ -36,7 +36,7 @@ class FetchWorker(TemplateWorker):
def _handle_article(self, article_watcher):
action = get_description # function
super()._handle_article(article_watcher, action)
article_watcher.fetch_completed = True
# article_watcher.fetch_completed = True
@ -52,7 +52,7 @@ class UploadWorker(TemplateWorker):
return run_upload(*args, **kwargs)
super()._handle_article(article_watcher, action)
article_watcher.upload_completed = True
# article_watcher.upload_completed = True
@ -63,4 +63,4 @@ class CompressWorker(TemplateWorker):
def _handle_article(self, article_watcher):
action = shrink_pdf
super()._handle_article(article_watcher, action)
article_watcher.compression_completed = True
# article_watcher.compression_completed = True