Working, refactored news_fetch, better documentation for launch
This commit is contained in:
parent
713406dc67
commit
afead44d6c
81
README.md
81
README.md
@ -11,18 +11,15 @@ A utility to
|
|||||||
... fully automatically. Run it now, thank me later.
|
... fully automatically. Run it now, thank me later.
|
||||||
|
|
||||||
---
|
---
|
||||||
## Running - Docker compose
|
|
||||||
|
|
||||||
The included `docker-compose` file is now necessary for easy orchestration of the various services.
|
## Running - through launch file
|
||||||
|
> Prerequisite: make `launch.cexecutable:
|
||||||
|
>
|
||||||
|
> `chmod +x launch`
|
||||||
|
|
||||||
All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions:
|
Execute the file by runnning `./launch`. This won't do anything in itself. You need to specify a mode, and then a command
|
||||||
|
|
||||||
* production
|
`./launch <mode> <command> <command options>`
|
||||||
* debug (development in general)
|
|
||||||
* upload
|
|
||||||
* check
|
|
||||||
|
|
||||||
These files will have to be adapted to your individual setup but won't change significantly once set up.
|
|
||||||
|
|
||||||
### Overview of the modes
|
### Overview of the modes
|
||||||
|
|
||||||
@ -30,47 +27,67 @@ The production mode performs all automatic actions and therfore does not require
|
|||||||
|
|
||||||
The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable `DEBUG=true` is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps.
|
The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable `DEBUG=true` is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps.
|
||||||
|
|
||||||
The check mode is less sophisticated but shows the downloaded articles to the host for visual verification. This requires passthroughs for X11.
|
Two additional 'modes' are `build` and `down`. Build rebuilds the container, which is necessary after code changes. Down ensures a clean shutdown of *all* containers. Usually the launch-script handles this already but it sometimes fails, in which case `down` needs to be called again.
|
||||||
|
|
||||||
Upload mode is much simpler, it goes over the exisiting database and operates on the articles, where the upload to archive.org has not yet occured (archive.org is slow and the other operations usually finish before the queue was consumed). It retries their upload.
|
|
||||||
|
|
||||||
* For normal `production` mode run:
|
|
||||||
|
|
||||||
`docker compose --env-file env/production run news_fetch`
|
|
||||||
|
|
||||||
|
|
||||||
* For `debug` mode run:
|
### Overview of the commands
|
||||||
|
|
||||||
`docker compose --env-file env/debug run news_fetch`
|
In essence a command is simply a service from docker-compose, which is run in an interactive environment. As such all services defined in `docker-compose.yaml` can be called as commands. Only two of them will be of real use:
|
||||||
|
|
||||||
which drops you into an interactive shell (`ctrl+d` to exit the container shell).
|
`news_fetch` does the majority of the actions mentionned above. By default, that is without any options, it runs a metadata-fetch, download, compression, and upload to archive.org. The upload is usually the slowest which is why articles that are processed but don't yet have an archive.org url tend to pile up. You can therefore specify the option `upload` which only starts the upload for the concerned articles, as a catch-up if you will.
|
||||||
|
|
||||||
> Note:
|
Example usage:
|
||||||
> The live-mounted code is now under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. Running `python runner.py` will now run the newly written code but, with the production database and storage.
|
|
||||||
|
|
||||||
* For `check` mode, some env-variables are also changed and you still require interactivity. You don't need the geckodriver service however. The simplest way is to run
|
```bash
|
||||||
|
./launch production news_fetch # full mode
|
||||||
|
./launch production news_fetch upload # upload mode (lighter resource usage)
|
||||||
|
./launch debug news_fetch # debug mode, which drops you inside a new shell
|
||||||
|
|
||||||
`docker compose --env-file env/check run --no-deps --rm news_fetch`
|
./launch production news_check
|
||||||
|
```
|
||||||
|
|
||||||
* Finally, for `upload` mode no interactivity is required and no additional services are required. Simply run:
|
`news_check` starts a webapp, accessible under [http://localhost:8080](http://localhost:8080) and allows you to easily check the downloaded articles.
|
||||||
|
|
||||||
`docker compose --env-file env/upload run --no-deps --rm news_fetch`
|
|
||||||
|
|
||||||
### Stopping
|
## (Running - Docker compose)
|
||||||
Run
|
> I strongly recommend sticking to the usage of `./launch`.
|
||||||
|
|
||||||
`docker compose --env-file env/production down`
|
Instead of using the launch file you can manually issue `docker compose` comands. Example: check for logs.
|
||||||
|
|
||||||
|
All relevant mounts and env-variables are easiest specified through the env-file, for which I configured 2 versions:
|
||||||
|
|
||||||
|
* production
|
||||||
|
* debug (development in general)
|
||||||
|
|
||||||
|
These files will have to be adapted to your individual setup but won't change significantly once set up.
|
||||||
|
|
||||||
|
Example usage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose --env-file env/production run news_fetch # full mode
|
||||||
|
docker compose --env-file env/production run news_fetch upload # upload mode (lighter resource usage)
|
||||||
|
docker compose --env-file env/debug run news_fetch # debug mode, which drops you inside a new shell
|
||||||
|
|
||||||
|
docker copose --env-file env/production news_check
|
||||||
|
|
||||||
|
# Misc:
|
||||||
|
docker compose --env-file env/production up # starts all services and shows their combined logs
|
||||||
|
docker compose --env-file env/production logs -f news_fetch # follows along with the logs of only one service
|
||||||
|
docker compose --env-file env/production down
|
||||||
|
```
|
||||||
|
|
||||||
which terminates all containers associated with the `docker-compose.yaml`.
|
|
||||||
|
|
||||||
## Building
|
## Building
|
||||||
|
|
||||||
> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.
|
> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly re build the docker image! This is also crucial to update the code itself.
|
||||||
|
|
||||||
In docker compose, run
|
In docker compose, run
|
||||||
|
|
||||||
`docker compose --env-file env/production build`
|
`docker compose --env-file env/production build`
|
||||||
|
|
||||||
|
Or simpler, just run
|
||||||
|
|
||||||
|
`./launch build`
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -80,6 +97,10 @@ In docker compose, run
|
|||||||
|
|
||||||
|
|
||||||
## Manual Sync to NAS:
|
## Manual Sync to NAS:
|
||||||
|
Manual sync is sadly still necessary, as the lsync client, sometimes gets overwhelmed by quick writes.
|
||||||
|
|
||||||
I use `rsync`. Mounting the NAS locally, I navigate to the location of the local folder (notice the trailing slash). Then run
|
I use `rsync`. Mounting the NAS locally, I navigate to the location of the local folder (notice the trailing slash). Then run
|
||||||
`rsync -Razq --no-perms --no-owner --no-group --temp-dir=/tmp --progress --log-file=rsync.log <local folder>/ "<remote>"`
|
`rsync -Razq --no-perms --no-owner --no-group --temp-dir=/tmp --progress --log-file=rsync.log <local folder>/ "<remote>"`
|
||||||
where `<remote>` is the location where the NAS is mounted. (options:`R` - relative paths , `a` - archive mode (multiple actions), `z` - ??, `q` - quiet. We also don't copy most of the metadata and we keep a log of the transfers.)
|
where `<remote>` is the location where the NAS is mounted. (options:`R` - relative paths , `a` - archive mode (multiple actions), `z` - ??, `q` - quiet. We also don't copy most of the metadata and we keep a log of the transfers.)
|
||||||
|
|
||||||
|
You can also use your OS' native copy option and select *de not overwrite*. This should only copy the missing files, significantly speeding up the operation.
|
@ -38,6 +38,7 @@ services:
|
|||||||
environment:
|
environment:
|
||||||
- START_VNC=${HEADFULL-false} # as opposed to headless, used when requiring supervision (eg. for websites that crash)
|
- START_VNC=${HEADFULL-false} # as opposed to headless, used when requiring supervision (eg. for websites that crash)
|
||||||
- START_XVFB=${HEADFULL-false}
|
- START_XVFB=${HEADFULL-false}
|
||||||
|
- SE_VNC_NO_PASSWORD=1
|
||||||
expose: ["4444"] # exposed to other docker-compose services only
|
expose: ["4444"] # exposed to other docker-compose services only
|
||||||
ports:
|
ports:
|
||||||
- 7900:7900 # port for webvnc
|
- 7900:7900 # port for webvnc
|
||||||
|
15
env/check
vendored
15
env/check
vendored
@ -1,15 +0,0 @@
|
|||||||
# Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth
|
|
||||||
|
|
||||||
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
|
||||||
|
|
||||||
XAUTHORTIY=$XAUTHORTIY
|
|
||||||
XSOCK=/tmp/.X11-unix
|
|
||||||
|
|
||||||
DEBUG=false
|
|
||||||
CHECK=true
|
|
||||||
HEADLESS=true
|
|
||||||
UPLOAD=false
|
|
||||||
REDUCEDFETCH=false
|
|
||||||
|
|
||||||
# ENTRYPOINT="/bin/bash"
|
|
||||||
INTERACTIVE=true
|
|
18
env/debug
vendored
18
env/debug
vendored
@ -1,14 +1,10 @@
|
|||||||
# Runs in a debugging mode, does not launch anything at all but starts a bash process
|
# Runs in a debugging mode, does not launch anything at all but starts a bash process
|
||||||
|
|
||||||
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||||
|
export UNAME=remy
|
||||||
|
|
||||||
CODE=./
|
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
|
||||||
|
export DEBUG=true
|
||||||
DEBUG=true
|
export HEADFULL=true
|
||||||
CHECK=false
|
export CODE=./
|
||||||
UPLOAD=false
|
export ENTRYPOINT=/bin/bash
|
||||||
HEADLESS=false
|
|
||||||
REDUCEDFETCH=false
|
|
||||||
|
|
||||||
ENTRYPOINT="/bin/bash"
|
|
||||||
INTERACTIVE=true
|
|
||||||
|
9
env/production
vendored
9
env/production
vendored
@ -2,9 +2,6 @@
|
|||||||
|
|
||||||
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||||
|
|
||||||
CONTAINERS_TO_RUN=nas_sync, geckodriver
|
export UNAME=remy
|
||||||
DEBUG=false
|
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
|
||||||
CHECK=false
|
export DEBUG=false
|
||||||
UPLOAD=false
|
|
||||||
HEADLESS=true
|
|
||||||
REDUCEDFETCH=true
|
|
||||||
|
10
env/upload
vendored
10
env/upload
vendored
@ -1,10 +0,0 @@
|
|||||||
# Does not run any other workers and only upploads to archive the urls that weren't previously uploaded
|
|
||||||
|
|
||||||
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
|
||||||
|
|
||||||
NEWS_FETCH_DEPENDS_ON="[]"
|
|
||||||
DEBUG=false
|
|
||||||
CHECK=false
|
|
||||||
UPLOAD=true
|
|
||||||
HEADLESS=true
|
|
||||||
REDUCEDFETCH=false
|
|
2
launch
2
launch
@ -9,7 +9,7 @@ echo "Bash script launching COSS_ARCHIVING..."
|
|||||||
export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||||
export UNAME=remy
|
export UNAME=remy
|
||||||
# CHANGE ME WHEN UPDATING FIREFOX
|
# CHANGE ME WHEN UPDATING FIREFOX
|
||||||
export GECKODRIVER_IMG=selenium/standalone-firefox:103.0
|
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
|
||||||
# version must be >= than the one on the host or firefox will not start (because of mismatched config)
|
# version must be >= than the one on the host or firefox will not start (because of mismatched config)
|
||||||
|
|
||||||
if [[ $1 == "debug" ]]
|
if [[ $1 == "debug" ]]
|
||||||
|
@ -25,7 +25,7 @@
|
|||||||
<td>{ item.name }</td>
|
<td>{ item.name }</td>
|
||||||
<!-- <td>Quality Control Specialist</td> -->
|
<!-- <td>Quality Control Specialist</td> -->
|
||||||
{#if item.value != ""}
|
{#if item.value != ""}
|
||||||
<td class='bg-emerald-200' style="white-space: normal">{ item.value }</td>
|
<td class='bg-emerald-200' style="white-space: normal; width:70%">{ item.value }</td>
|
||||||
{:else}
|
{:else}
|
||||||
<td class='bg-red-200'>{ item.value }</td>
|
<td class='bg-red-200'>{ item.value }</td>
|
||||||
{/if}
|
{/if}
|
||||||
|
@ -2,6 +2,8 @@ FROM python:latest
|
|||||||
|
|
||||||
ENV TZ Europe/Zurich
|
ENV TZ Europe/Zurich
|
||||||
|
|
||||||
|
RUN apt-get update && apt-get install -y ghostscript
|
||||||
|
# for compression of pdfs
|
||||||
|
|
||||||
RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
|
RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
|
||||||
# id mapped to local user
|
# id mapped to local user
|
||||||
|
@ -1,7 +1,8 @@
|
|||||||
import os
|
import os
|
||||||
import shutil
|
|
||||||
import configparser
|
import configparser
|
||||||
import logging
|
import logging
|
||||||
|
import time
|
||||||
|
import shutil
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from peewee import SqliteDatabase, PostgresqlDatabase
|
from peewee import SqliteDatabase, PostgresqlDatabase
|
||||||
from rich.logging import RichHandler
|
from rich.logging import RichHandler
|
||||||
@ -41,6 +42,7 @@ if os.getenv("DEBUG", "false") == "true":
|
|||||||
else:
|
else:
|
||||||
logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
|
logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
|
||||||
|
|
||||||
|
time.sleep(10) # wait for the vpn to connect (can't use a healthcheck because there is no depends_on)
|
||||||
cred = db_config["DATABASE"]
|
cred = db_config["DATABASE"]
|
||||||
download_db = PostgresqlDatabase(
|
download_db = PostgresqlDatabase(
|
||||||
cred["db_name"], user=cred["user_name"], password=cred["password"], host="vpn", port=5432
|
cred["db_name"], user=cred["user_name"], password=cred["password"], host="vpn", port=5432
|
||||||
|
@ -3,125 +3,91 @@ import configuration
|
|||||||
models = configuration.models
|
models = configuration.models
|
||||||
from threading import Thread
|
from threading import Thread
|
||||||
import logging
|
import logging
|
||||||
import sys
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
import sys
|
||||||
|
from collections import OrderedDict
|
||||||
|
|
||||||
from utils_mail import runner as mail_runner
|
|
||||||
from utils_slack import runner as slack_runner
|
from utils_mail import runner as MailRunner
|
||||||
|
from utils_slack import runner as SlackRunner
|
||||||
from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, UploadWorker
|
from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, UploadWorker
|
||||||
|
|
||||||
|
|
||||||
class ArticleWatcher:
|
class ArticleWatcher:
|
||||||
"""Wrapper for a newly created article object. Notifies the coordinator upon change/completition"""
|
"""Wrapper for a newly created article object. Notifies the coordinator upon change/completition"""
|
||||||
def __init__(self, article, **kwargs) -> None:
|
def __init__(self, article, workers_in, workers_out) -> None:
|
||||||
self.article_id = article.id # in case article becomes None at any point, we can still track the article
|
|
||||||
self.article = article
|
self.article = article
|
||||||
|
|
||||||
self.completition_notifier = kwargs.get("notifier")
|
self.workers_in = workers_in
|
||||||
self.fetch = kwargs.get("worker_fetch", None)
|
self.workers_out = workers_out
|
||||||
self.download = kwargs.get("worker_download", None)
|
|
||||||
self.compress = kwargs.get("worker_compress", None)
|
|
||||||
self.upload = kwargs.get("worker_upload", None)
|
|
||||||
|
|
||||||
self.completition_notified = False
|
self.completition_notified = False
|
||||||
# self._download_called = self._compression_called = False
|
|
||||||
self._fetch_completed = self._download_completed = self._compression_completed = self._upload_completed = False
|
|
||||||
|
|
||||||
# first step: gather metadata
|
for w_dict in self.workers_in:
|
||||||
if self.fetch and self.upload:
|
worker = self.get_next_worker(w_dict) # gets the first worker of each dict (they get processed independently)
|
||||||
self.fetch.process(self) # this will call the update_status method
|
worker.process(self)
|
||||||
self.upload.process(self) # idependent from the rest
|
|
||||||
else: # the full kwargs were not provided, only do a manual run
|
|
||||||
# overwrite update_status() because calls from the workers will result in erros
|
|
||||||
self.update_status = lambda completed: logger.info(f"Completed action {completed}")
|
|
||||||
for w in kwargs.get("workers_manual"):
|
|
||||||
w.process(self)
|
|
||||||
|
|
||||||
|
|
||||||
def update_status(self, completed_action):
|
def get_next_worker(self, worker_dict, worker_name=""):
|
||||||
"""Checks and notifies internal completition-status.
|
"""Returns the worker coming after the one with key worker_name"""
|
||||||
Article download is complete iff fetch and download were successfull and compression was run
|
|
||||||
"""
|
if worker_name == "": # first one
|
||||||
# if self.completition_notified and self._compression_completed and self._fetch_completed and self._download_completed and self._upload_completed, we are done
|
return worker_dict[list(worker_dict.keys())[0]]
|
||||||
if completed_action == "fetch":
|
# for i,w_dict in enumerate(workers_list):
|
||||||
self.download.process(self)
|
keys = list(worker_dict.keys())
|
||||||
elif completed_action == "download":
|
next_key_ind = keys.index(worker_name) + 1
|
||||||
self.compress.process(self)
|
try:
|
||||||
elif completed_action == "compress": # last step
|
key = keys[next_key_ind]
|
||||||
self.completition_notifier(self.article)
|
return worker_dict[key]
|
||||||
# triggers action in Coordinator
|
except IndexError:
|
||||||
elif completed_action == "upload":
|
return None
|
||||||
# this case occurs when upload was faster than compression
|
|
||||||
pass
|
|
||||||
else:
|
|
||||||
logger.warning(f"update_status called with unusual configuration: {completed_action}")
|
|
||||||
|
|
||||||
|
|
||||||
# ====== Attributes to be modified by the util workers
|
def update(self, worker_name):
|
||||||
@property
|
"""Called by the workers to notify the watcher of a completed step"""
|
||||||
def fetch_completed(self):
|
for w_dict in self.workers_in:
|
||||||
return self._fetch_completed
|
if worker_name in w_dict.keys():
|
||||||
|
next_worker = self.get_next_worker(w_dict, worker_name)
|
||||||
|
if next_worker:
|
||||||
|
if next_worker == "out":
|
||||||
|
self.completion_notifier()
|
||||||
|
else: # it's just another in-worker
|
||||||
|
next_worker.process(self)
|
||||||
|
else: # no next worker, we are done
|
||||||
|
logger.info(f"No worker after {worker_name}")
|
||||||
|
|
||||||
@fetch_completed.setter
|
|
||||||
def fetch_completed(self, value: bool):
|
|
||||||
self._fetch_completed = value
|
|
||||||
self.update_status("fetch")
|
|
||||||
|
|
||||||
@property
|
def completion_notifier(self):
|
||||||
def download_completed(self):
|
"""Triggers the out-workers to process the article, that is to send out a message"""
|
||||||
return self._download_completed
|
for w_dict in self.workers_out:
|
||||||
|
worker = self.get_next_worker(w_dict)
|
||||||
|
worker.send(self.article)
|
||||||
|
self.article.sent = True
|
||||||
|
self.article.save()
|
||||||
|
|
||||||
@download_completed.setter
|
|
||||||
def download_completed(self, value: bool):
|
|
||||||
self._download_completed = value
|
|
||||||
self.update_status("download")
|
|
||||||
|
|
||||||
@property
|
|
||||||
def compression_completed(self):
|
|
||||||
return self._compression_completed
|
|
||||||
|
|
||||||
@compression_completed.setter
|
|
||||||
def compression_completed(self, value: bool):
|
|
||||||
self._compression_completed = value
|
|
||||||
self.update_status("compress")
|
|
||||||
|
|
||||||
@property
|
|
||||||
def upload_completed(self):
|
|
||||||
return self._upload_completed
|
|
||||||
|
|
||||||
@upload_completed.setter
|
|
||||||
def upload_completed(self, value: bool):
|
|
||||||
self._upload_completed = value
|
|
||||||
self.update_status("upload")
|
|
||||||
|
|
||||||
def __str__(self) -> str:
|
def __str__(self) -> str:
|
||||||
return f"Article with id {self.article_id}"
|
return f"ArticleWatcher with id {self.article_id}"
|
||||||
|
|
||||||
|
|
||||||
class Coordinator(Thread):
|
|
||||||
def __init__(self, **kwargs) -> None:
|
|
||||||
"""Launcher calls this Coordinator as the main thread to handle connections between the other workers (threaded)."""
|
|
||||||
super().__init__(target = self.launch, daemon=True)
|
|
||||||
|
|
||||||
def add_workers(self, **kwargs):
|
class Dispatcher(Thread):
|
||||||
self.worker_slack = kwargs.pop("worker_slack", None)
|
def __init__(self) -> None:
|
||||||
self.worker_mail = kwargs.pop("worker_mail", None)
|
"""Thread to handle handle incoming requests and control the workers"""
|
||||||
# the two above won't be needed in the Watcher
|
self.workers_in = []
|
||||||
self.worker_download = kwargs.get("worker_download", None)
|
self.workers_out = []
|
||||||
self.worker_fetch = kwargs.get("worker_fetch", None)
|
super().__init__(target = self.launch)
|
||||||
self.worker_compress = kwargs.get("worker_compress", None)
|
|
||||||
self.worker_upload = kwargs.get("worker_upload", None)
|
|
||||||
|
|
||||||
self.kwargs = kwargs
|
|
||||||
|
|
||||||
def launch(self) -> None:
|
def launch(self) -> None:
|
||||||
for w in [self.worker_download, self.worker_fetch, self.worker_upload, self.worker_compress]:
|
# start workers (each worker is a thread)
|
||||||
if not w is None: # for reduced operations such as upload, some workers are set to None
|
for w_dict in self.workers_in: # for reduced operations such as upload, some workers are not set
|
||||||
|
for w in w_dict.values():
|
||||||
|
if isinstance(w, Thread):
|
||||||
w.start()
|
w.start()
|
||||||
|
|
||||||
# if past messages have not been sent, they must be reevaluated
|
# get all articles not fully processed
|
||||||
unsent = models.ArticleDownload.filter(sent = False)
|
unsent = models.ArticleDownload.filter(sent = False) # if past messages have not been sent, they must be reevaluated
|
||||||
# .objects.filter(sent = False)
|
|
||||||
for a in unsent:
|
for a in unsent:
|
||||||
self.incoming_request(article=a)
|
self.incoming_request(article=a)
|
||||||
|
|
||||||
@ -136,82 +102,82 @@ class Coordinator(Thread):
|
|||||||
return
|
return
|
||||||
article, is_new = models.ArticleDownload.get_or_create(article_url=url)
|
article, is_new = models.ArticleDownload.get_or_create(article_url=url)
|
||||||
article.slack_ts = message.ts # either update the timestamp (to the last reference to the article) or set it for the first time
|
article.slack_ts = message.ts # either update the timestamp (to the last reference to the article) or set it for the first time
|
||||||
|
article.save()
|
||||||
elif article is not None:
|
elif article is not None:
|
||||||
is_new = False
|
is_new = False
|
||||||
logger.info(f"Received article {article} in incoming_request")
|
logger.info(f"Received article {article} in incoming_request")
|
||||||
else:
|
else:
|
||||||
logger.error("Coordinator.incoming_request called with no arguments")
|
logger.error("Dispatcher.incoming_request called with no arguments")
|
||||||
return
|
return
|
||||||
|
|
||||||
self.kwargs.update({"notifier" : self.article_complete_notifier})
|
|
||||||
|
|
||||||
if is_new or (article.file_name == "" and article.verified == 0):
|
if is_new or (article.file_name == "" and article.verified == 0):
|
||||||
# check for models that were created but were abandonned. This means they have missing information, most importantly no associated file
|
# check for models that were created but were abandonned. This means they have missing information, most importantly no associated file
|
||||||
# this overwrites previously set information, but that should not be too important
|
# this overwrites previously set information, but that should not be too important
|
||||||
ArticleWatcher(
|
ArticleWatcher(
|
||||||
article,
|
article,
|
||||||
**self.kwargs
|
workers_in=self.workers_in,
|
||||||
|
workers_out=self.workers_out,
|
||||||
)
|
)
|
||||||
|
|
||||||
# All workers are implemented as a threaded queue. But the individual model requires a specific processing order:
|
|
||||||
# fetch -> download -> compress -> complete
|
|
||||||
# the watcher orchestrates the procedure and notifies upon completition
|
|
||||||
# the watcher will notify once it is sufficiently populated
|
|
||||||
else: # manually trigger notification immediatly
|
else: # manually trigger notification immediatly
|
||||||
logger.info(f"Found existing article {article}. Now sending")
|
logger.info(f"Found existing article {article}. Now sending")
|
||||||
self.article_complete_notifier(article)
|
self.article_complete_notifier(article)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def manual_processing(self, articles, workers):
|
# def manual_processing(self, articles, workers):
|
||||||
for w in workers:
|
# for w in workers:
|
||||||
w.start()
|
# w.start()
|
||||||
|
|
||||||
for article in articles:
|
# for article in articles:
|
||||||
notifier = lambda article: logger.info(f"Completed manual actions for {article}")
|
# notifier = lambda article: logger.info(f"Completed manual actions for {article}")
|
||||||
ArticleWatcher(article, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg
|
# ArticleWatcher(article, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg
|
||||||
|
|
||||||
def article_complete_notifier(self, article):
|
|
||||||
if self.worker_slack is None:
|
|
||||||
logger.warning("Skipping slack notification because worker is None")
|
|
||||||
else:
|
|
||||||
self.worker_slack.bot_worker.respond_channel_message(article)
|
|
||||||
if self.worker_mail is None:
|
|
||||||
logger.warning("Skipping mail notification because worker is None")
|
|
||||||
else:
|
|
||||||
self.worker_mail.send(article)
|
|
||||||
|
|
||||||
article.sent = True
|
|
||||||
article.save()
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
coordinator = Coordinator()
|
dispatcher = Dispatcher()
|
||||||
|
|
||||||
|
|
||||||
if "upload" in sys.argv:
|
if "upload" in sys.argv:
|
||||||
|
class PrintWorker:
|
||||||
|
def send(self, article):
|
||||||
|
print(f"Uploaded article {article}")
|
||||||
|
|
||||||
articles = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "" or models.ArticleDownload.archive_url == "TODO:UPLOAD").execute()
|
articles = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "" or models.ArticleDownload.archive_url == "TODO:UPLOAD").execute()
|
||||||
logger.info(f"Launching upload to archive for {len(articles)} articles.")
|
logger.info(f"Launching upload to archive for {len(articles)} articles.")
|
||||||
coordinator.manual_processing(articles, [UploadWorker()])
|
|
||||||
|
dispatcher.workers_in = [{"UploadWorker": UploadWorker()}]
|
||||||
|
dispatcher.workers_out = [{"PrintWorker": PrintWorker()}]
|
||||||
|
dispatcher.start()
|
||||||
|
|
||||||
else: # launch with full action
|
else: # launch with full action
|
||||||
slack_runner = slack_runner.BotRunner(coordinator.incoming_request)
|
|
||||||
kwargs = {
|
|
||||||
"worker_download" : DownloadWorker(),
|
|
||||||
"worker_fetch" : FetchWorker(),
|
|
||||||
"worker_upload" : UploadWorker(),
|
|
||||||
"worker_compress" : CompressWorker(),
|
|
||||||
"worker_slack" : slack_runner,
|
|
||||||
"worker_mail" : mail_runner,
|
|
||||||
}
|
|
||||||
try:
|
try:
|
||||||
coordinator.add_workers(**kwargs)
|
slack_runner = SlackRunner.BotRunner(dispatcher.incoming_request)
|
||||||
coordinator.start()
|
# All workers are implemented as a threaded queue. But the individual model requires a specific processing order:
|
||||||
|
# fetch -> download -> compress -> complete
|
||||||
|
# This is reflected in the following list of workers:
|
||||||
|
workers_in = [
|
||||||
|
OrderedDict({"FetchWorker": FetchWorker(), "DownloadWorker": DownloadWorker(), "CompressWorker": CompressWorker(), "NotifyRunner": "out"}),
|
||||||
|
OrderedDict({"UploadWorker": UploadWorker()})
|
||||||
|
]
|
||||||
|
# The two dicts are processed independently. First element of first dict is called at the same time as the first element of the second dict
|
||||||
|
# Inside a dict, the order of the keys gives the order of execution (only when the first element is done, the second is called, etc...)
|
||||||
|
|
||||||
|
workers_out = [{"SlackRunner": slack_runner},{"MailRunner": MailRunner}]
|
||||||
|
|
||||||
|
dispatcher.workers_in = workers_in
|
||||||
|
dispatcher.workers_out = workers_out
|
||||||
|
|
||||||
|
dispatcher.start() # starts the thread, (ie. runs launch())
|
||||||
slack_runner.start() # last one to start, inside the main thread
|
slack_runner.start() # last one to start, inside the main thread
|
||||||
except KeyboardInterrupt:
|
except KeyboardInterrupt:
|
||||||
logger.info("Keyboard interrupt. Stopping Slack and Coordinator")
|
logger.info("Keyboard interrupt. Stopping Slack and dispatcher")
|
||||||
slack_runner.stop()
|
slack_runner.stop()
|
||||||
logger.info("BYE!")
|
dispatcher.join()
|
||||||
# coordinator was set as a daemon thread, so it will be stopped automatically
|
for w_dict in workers_in:
|
||||||
|
for w in w_dict.values():
|
||||||
|
if isinstance(w, Thread):
|
||||||
|
w.stop()
|
||||||
|
|
||||||
|
# All threads are launched as a daemon thread, meaning that any 'leftover' should exit along with the sys call
|
||||||
sys.exit(0)
|
sys.exit(0)
|
@ -157,7 +157,12 @@ class BotApp(App):
|
|||||||
if say is None:
|
if say is None:
|
||||||
say = self.say_substitute
|
say = self.say_substitute
|
||||||
answers = article.slack_info
|
answers = article.slack_info
|
||||||
for a in answers:
|
if article.slack_ts == 0:
|
||||||
|
self.logger.error(f"{article} has no slack_ts")
|
||||||
|
else:
|
||||||
|
self.logger.info("Skipping slack reply because it is broken")
|
||||||
|
for a in []:
|
||||||
|
# for a in answers:
|
||||||
if a["file_path"]:
|
if a["file_path"]:
|
||||||
try:
|
try:
|
||||||
self.client.files_upload(
|
self.client.files_upload(
|
||||||
@ -166,20 +171,20 @@ class BotApp(App):
|
|||||||
file = a["file_path"],
|
file = a["file_path"],
|
||||||
thread_ts = article.slack_ts_full
|
thread_ts = article.slack_ts_full
|
||||||
)
|
)
|
||||||
status = True
|
# status = True
|
||||||
except SlackApiError as e: # upload resulted in an error
|
except SlackApiError as e: # upload resulted in an error
|
||||||
say(
|
say(
|
||||||
"File {} could not be uploaded.".format(a),
|
"File {} could not be uploaded.".format(a),
|
||||||
thread_ts = article.slack_ts_full
|
thread_ts = article.slack_ts_full
|
||||||
)
|
)
|
||||||
status = False
|
# status = False
|
||||||
self.logger.error(f"File upload failed: {e}")
|
self.logger.error(f"File upload failed: {e}")
|
||||||
else: # anticipated that there is no file!
|
else: # anticipated that there is no file!
|
||||||
say(
|
say(
|
||||||
f"{a['reply_text']}",
|
f"{a['reply_text']}",
|
||||||
thread_ts = article.slack_ts_full
|
thread_ts = article.slack_ts_full
|
||||||
)
|
)
|
||||||
status = True
|
# status = True
|
||||||
|
|
||||||
|
|
||||||
def startup_status(self):
|
def startup_status(self):
|
||||||
@ -230,6 +235,9 @@ class BotRunner():
|
|||||||
self.logger.info("Closed Slack-Socketmodehandler")
|
self.logger.info("Closed Slack-Socketmodehandler")
|
||||||
|
|
||||||
|
|
||||||
|
def send(self, article):
|
||||||
|
"""Proxy function to send a message to the slack channel, Called by ArticleWatcher once the Article is ready"""
|
||||||
|
self.bot_worker.respond_channel_message(article)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -7,12 +7,10 @@ class TemplateWorker(Thread):
|
|||||||
"""Parent class for any subsequent worker of the article-download pipeline. They should all run in parallel, thus the Thread subclassing"""
|
"""Parent class for any subsequent worker of the article-download pipeline. They should all run in parallel, thus the Thread subclassing"""
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
def __init__(self, *args, **kwargs) -> None:
|
def __init__(self, **kwargs) -> None:
|
||||||
target = self._queue_processor # will be executed on Worker.start()
|
target = self._queue_processor # will be executed on Worker.start()
|
||||||
group = kwargs.get("group", None)
|
self.keep_running = True
|
||||||
name = kwargs.get("name", None)
|
super().__init__(target=target, daemon=True)
|
||||||
|
|
||||||
super().__init__(group=group, target=target, name=name)
|
|
||||||
self._article_queue = []
|
self._article_queue = []
|
||||||
self.logger.info(f"Worker thread {self.__class__.__name__} initialized successfully")
|
self.logger.info(f"Worker thread {self.__class__.__name__} initialized successfully")
|
||||||
|
|
||||||
@ -23,7 +21,7 @@ class TemplateWorker(Thread):
|
|||||||
|
|
||||||
def _queue_processor(self):
|
def _queue_processor(self):
|
||||||
"""This method is launched by thread.run() and idles when self._article_queue is empty. When an external caller appends to the queue it jumps into action"""
|
"""This method is launched by thread.run() and idles when self._article_queue is empty. When an external caller appends to the queue it jumps into action"""
|
||||||
while True: # PLEASE tell me if I'm missing an obvious better way of doing this!
|
while self.keep_running: # PLEASE tell me if I'm missing an obvious better way of doing this!
|
||||||
if len(self._article_queue) == 0:
|
if len(self._article_queue) == 0:
|
||||||
time.sleep(5)
|
time.sleep(5)
|
||||||
else:
|
else:
|
||||||
@ -39,3 +37,10 @@ class TemplateWorker(Thread):
|
|||||||
article = article_watcher.article
|
article = article_watcher.article
|
||||||
article = action(article) # action updates the article object but does not save the change
|
article = action(article) # action updates the article object but does not save the change
|
||||||
article.save()
|
article.save()
|
||||||
|
article_watcher.update(self.__class__.__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def stop(self):
|
||||||
|
self.logger.info(f"Stopping worker {self.__class__.__name__} whith {len(self._article_queue)} articles left in queue")
|
||||||
|
self.keep_running = False
|
||||||
|
self.join()
|
||||||
|
@ -25,7 +25,7 @@ class DownloadWorker(TemplateWorker):
|
|||||||
action = self.dl_runner
|
action = self.dl_runner
|
||||||
|
|
||||||
super()._handle_article(article_watcher, action)
|
super()._handle_article(article_watcher, action)
|
||||||
article_watcher.download_completed = True
|
# article_watcher.download_completed = True
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -36,7 +36,7 @@ class FetchWorker(TemplateWorker):
|
|||||||
def _handle_article(self, article_watcher):
|
def _handle_article(self, article_watcher):
|
||||||
action = get_description # function
|
action = get_description # function
|
||||||
super()._handle_article(article_watcher, action)
|
super()._handle_article(article_watcher, action)
|
||||||
article_watcher.fetch_completed = True
|
# article_watcher.fetch_completed = True
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -52,7 +52,7 @@ class UploadWorker(TemplateWorker):
|
|||||||
return run_upload(*args, **kwargs)
|
return run_upload(*args, **kwargs)
|
||||||
|
|
||||||
super()._handle_article(article_watcher, action)
|
super()._handle_article(article_watcher, action)
|
||||||
article_watcher.upload_completed = True
|
# article_watcher.upload_completed = True
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -63,4 +63,4 @@ class CompressWorker(TemplateWorker):
|
|||||||
def _handle_article(self, article_watcher):
|
def _handle_article(self, article_watcher):
|
||||||
action = shrink_pdf
|
action = shrink_pdf
|
||||||
super()._handle_article(article_watcher, action)
|
super()._handle_article(article_watcher, action)
|
||||||
article_watcher.compression_completed = True
|
# article_watcher.compression_completed = True
|
Loading…
x
Reference in New Issue
Block a user