upload to new gitea

Better launch, cleaner shutdown (wip)
tiny improvements
2022-08-23 15:12:39 +02:00 · 2022-08-11 13:42:45 +02:00 · 2022-07-27 09:04:45 +02:00 · 2022-07-23 17:21:00 +02:00 · 2022-06-25 16:06:45 +02:00
30 changed files with 261 additions and 188 deletions
--- a/39
+++ b/39
@ -1,39 +0,0 @@
-FROM python:latest
-
-ENV TZ Euopre/Zurich
-
-# RUN echo "deb http://deb.debian.org/debian/ unstable main contrib non-free" >> /etc/apt/sources.list
-# allows the installation of the latest firefox-release (debian is not usually a rolling release)
-RUN apt-get update && apt-get install -y \
-evince \
-# for checking
-xauth \
-#for gui
-# wget tar firefox \
-# for geckodriver
-ghostscript
-# for compression
-
-
-# Download gecko (firefox) driver for selenium
-# RUN wget https://github.com/mozilla/geckodriver/releases/download/v0.31.0/geckodriver-v0.31.0-linux64.tar.gz
-# RUN tar -x geckodriver -zf geckodriver-v0.31.0-linux64.tar.gz -O > /usr/bin/geckodriver
-# RUN chmod +x /usr/bin/geckodriver
-# RUN rm geckodriver-v0.31.0-linux64.tar.gz
-
-
-RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
-# id mapped to local user
-# home directory needed for pip package installation
-RUN mkdir -p /app/auto_news
-RUN chown -R autonews:autonews /app
-USER autonews
-RUN export PATH=/home/autonews/.local/bin:$PATH
-
-COPY requirements.txt /app/requirements.txt
-RUN python3 -m pip install -r /app/requirements.txt
-
-COPY app /app/auto_news
-WORKDIR /app/auto_news
-
-ENTRYPOINT ["python3", "runner.py"]
--- a/README.md
+++ b/README.md
@ -1,88 +1,85 @@
-# Auto_news
+# COSS_ARCHIVING

-A utility to fetch article requests from slack and generate pdfs for them, fully automatically.
+A utility to

+* fetch article requests from slack
+* generate pdfs for them
+* compress them
+* send them via slack + email
+* upload them to the COSS NAS

-## Running - Pure docker
-> I recommend running with docker compose instead
-### How to run - auto archiving mode
-In this mode the program is launched as a docker container, in a headless mode. For persistence purposes a local storage volume is required, but that's it!
-
-`docker run -it -v <your storage>:/app/file_storage/ auto_news`
-
-You can specify additional parameters: 
-
-`docker run -it -v <your storage>:/app/file_storage/ auto_news debug` runs with debug values (does not write to prod db, does not send mails)
-
-`docker run -it -v <your storage>:/app/file_storage/ auto_news upload` catches up on incomplete uploads to archive.
-
-`docker run -it -v <your storage>:/app/file_storage/ auto_news reducedfetch` makes assumption about the status of the slack chat and greatly reduces the number of api calls (faster start up).
-
-These parameters can be combined (mostyl for testing I guess)
-
-Finally for manual file verification:
-
-`docker run -it -v <your storage>:/app/file_storage/ -e DISPLAY=":0" --network host -v \$XAUTHORITY:/root/.Xauthority auto_news check` lets you visually verify the downloaded files. The additional parameters are required in order to open guis on the host.
-
-
-### How to run - development mode
-In this mode, a docker container is launched with an additional volume, the local code. You can test your code without the need to rebuild the image.
-
-`docker run -it -v <your storage>:/app/file_storage/ -v <your code>:/code/ --entry-point /bin/bash auto_news`
-You are droppped into a bash shell, in which you can navigate to the `/code` directory and then test live.
-
-### Cheat-sheet Remy:
-
-`docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ auto_news`
-
-`docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ -v /mnt/Data/COSS/Development/auto_news/app:/code --entrypoint /bin/bash auto_news`
-
-
-`docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ -e DISPLAY=":0" --network host -v XAUTHORITY:/root/.Xauthority auto_news check`
-
+... fully automatically. Run it now, thank me later.

+---
 ## Running - Docker compose 

-I also wrote a rudimentary docker compose file which makes running much more simple. Just run
+The included `docker-compose` file is now necessary for easy orchestration of the various services. 

-`docker compose --env-file <desired mode> up`
+All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions: 

-All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions: production, debug (development in general), upload and check. These files will have to be adapted to your individual setup but can be reused more easily.
+* production
+* debug (development in general)
+* upload
+* check 

-For the debug env-file, you will likely want interactivity, so you need to run:
+These files will have to be adapted to your individual setup but won't change significantly once set up.

-`docker compose --env-file env/debug  up -d && docker compose --env-file env/debug exec auto_news bash && docker compose --env-file env/debug down`
+### Overview of the modes
+
+The production mode performs all automatic actions and therfore does not require any manual intervention. It queries the slack workspace, adds the new requests to the database, downloads all files and metadata, uploads the urls to archive.org and sends out the downloaded article. As a last step the newly created file is synced to the COSS-NAS.
+
+The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable `DEBUG=true` is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps.
+
+The check mode is less sophisticated but shows the downloaded articles to the host for visual verification. This requires passthroughs for X11.
+
+Upload mode is much simpler, it goes over the exisiting database and operates on the articles, where the upload to archive.org has not yet occured (archive.org is slow and the other operations usually finish before the queue was consumed). It retries their upload. 
+
+* For normal `production` mode run:
+
+    `docker compose --env-file env/production run news_fetch`


-<!-- > Note:
->
-> The `debug` requires additional input. Once `docker compose up` is running, in a new session run `docker compose --env-file env/debug exec bash`. The live-mounted code is then under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`.
-->
+* For `debug` mode run:

+    `docker compose --env-file env/debug run news_fetch`
    
+    which drops you into an interactive shell (`ctrl+d` to exit the container shell).
+
+    > Note:
+    > The live-mounted code is now under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. Running `python runner.py` will now run the newly written code but, with the production database and storage.
+
+* For `check` mode, some env-variables are also changed and you still require interactivity. You don't need the geckodriver service however. The simplest way is to run
+
+    `docker compose --env-file env/check run --no-deps --rm news_fetch`
+
+* Finally, for `upload` mode no interactivity is required and no additional services are required. Simply run:
+    
+    `docker compose --env-file env/upload run --no-deps --rm news_fetch`
+
+### Stopping
+Run 
+
+`docker compose --env-file env/production down`
+
+which terminates all containers associated with the `docker-compose.yaml`. 

 ## Building

 > The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.

-In docker, simply run:
-
-`docker build -t auto_news --no-cache .`
-
-where the `Dockerfile` has to be in the working directory
-
-In docker compose, run the usual command, but append 
-
-`docker compose ... up --build`
-
+In docker compose, run 

+`docker compose --env-file env/production build`




 ## Roadmap:

-[ ] automatically upload files to NAS
+[_] handle paywalled sites like faz, spiegel, ... through their dedicated sites (see nexisuni.com for instance), available through the ETH network

-[ ] handle paywalled sites like faz, spiegel, .. through their dedicated edu-friendly sites
-...
+
+## Manual Sync to NAS:
+I use `rsync`. Mounting the NAS locally, I navigate to the location of the local folder (notice the trailing slash). Then run
+`rsync -Razq --no-perms --no-owner --no-group --temp-dir=/tmp --progress --log-file=rsync.log <local folder>/ "<remote>"`
+where `<remote>` is the location where the NAS is mounted. (options:`R` - relative paths  , `a` - archive mode (multiple actions), `z` - ??, `q` - quiet. We also don't copy most of the metadata and we keep a log of the transfers.)
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@ -1,30 +1,12 @@
-# docker compose --env-file env/debug up 
+# Usage:
+# docker compose --env-file env/<mode> run <args> news_fetch && docker-compose --env-file env/production down

 version: "3.9"
+
 services:
-  auto_news:
-    build: .
-    image: auto_news:latest
-    volumes:
-      - ${CONTAINER_DATA}:/app/file_storage
-      - ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
-      - ${XSOCK-/dev/null}:${XSOCK-/tmp/sock}
-      - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority
-    environment:
-      - DISPLAY=$DISPLAY
-      - TERM=xterm-256color # colored logs
-      - COLUMNS=160 # for wider logs
-      - DEBUG=${DEBUG}
-      - CHECK=${CHECK}
-      - UPLOAD=${UPLOAD}
-      - HEADLESS=${HEADLESS}
-      - REDUCEDFETCH=${REDUCEDFETCH}
-    entrypoint: ${ENTRYPOINT:-python3 runner.py} # by default launch workers as defined in the Dockerfile
-    stdin_open: ${INTERACTIVE:-false} # docker run -i
-    tty: ${INTERACTIVE:-false}        # docker run -t

  geckodriver:
-    image: selenium/standalone-firefox:101.0
+    image: selenium/standalone-firefox:103.0
    volumes:
      - ${XSOCK-/dev/null}:${XSOCK-/tmp/sock}
      - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority
@ -35,3 +17,59 @@ services:
    user: 1001:1001
    expose: # exposed to other docker-compose services only
    - "4444"
+
+
+  vpn:
+    image: wazum/openconnect-proxy:latest
+    env_file:
+      - ${CONTAINER_DATA}/config/vpn.config
+    cap_add:
+    - NET_ADMIN
+    volumes:
+      - /dev/net/tun:/dev/net/tun
+    # alternative to cap_add & volumes: specify privileged: true
+
+
+  nas_sync:
+    depends_on:
+      - vpn # used to establish a connection to the SMB server
+    network_mode: "service:vpn"
+    build: nas_sync
+    image: nas_sync:latest
+    cap_add: # capabilities needed for mounting the SMB share
+      - SYS_ADMIN
+      - DAC_READ_SEARCH
+    volumes:
+      - ${CONTAINER_DATA}/files:/sync/local_files
+      - ${CONTAINER_DATA}/config/nas_sync.config:/sync/nas_sync.config
+      - ${CONTAINER_DATA}/config/nas_login.config:/sync/nas_login.config
+    command: 
+      - nas22.ethz.ch/gess_coss_1/helbing_support/Files RM/Archiving/TEST # first command is the target mount path
+      - lsyncd
+      - /sync/nas_sync.config
+
+
+  news_fetch:
+    build: news_fetch
+    image: news_fetch:latest
+
+    depends_on: # when using docker compose run news_fetch, the dependencies are started as well
+      - nas_sync
+      - geckodriver
+
+    volumes:
+      - ${CONTAINER_DATA}:/app/containerdata # always set
+      - ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
+      - ${XSOCK-/dev/null}:${XSOCK-/tmp/sock} # x11 socket, needed for gui
+      # - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority # xauth needed for authenticating to x11
+    environment:
+      - DISPLAY=$DISPLAY # needed to let x11 apps know where to connect to
+
+      - DEBUG=${DEBUG}
+      - CHECK=${CHECK}
+      - UPLOAD=${UPLOAD}
+      - HEADLESS=${HEADLESS}
+      - REDUCEDFETCH=${REDUCEDFETCH}
+    entrypoint: ${ENTRYPOINT:-python3 runner.py} # by default launch workers as defined in the Dockerfile
+    stdin_open: ${INTERACTIVE:-false} # docker run -i
+    tty: ${INTERACTIVE:-false}        # docker run -t
--- a/env/check
+++ b/env/check
@ -1,7 +1,6 @@
 # Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth

-CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container
-HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts
+CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving

 XAUTHORTIY=$XAUTHORTIY
 XSOCK=/tmp/.X11-unix
@ -11,3 +10,6 @@ CHECK=true
 HEADLESS=true
 UPLOAD=false
 REDUCEDFETCH=false
+
+# ENTRYPOINT="/bin/bash"
+INTERACTIVE=true
--- a/env/debug
+++ b/env/debug
@ -1,7 +1,6 @@
 # Runs in a debugging mode, does not launch anything at all but starts a bash process

-CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container
-HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts
+CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving

 CODE=./
 XAUTHORTIY=$XAUTHORTIY
--- a/env/production
+++ b/env/production
@ -1,8 +1,8 @@
 # Runs on the main slack channel with the full worker setup. If nothing funky has occured, reducedfetch is a speedup

-CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container
-HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts
+CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving

+CONTAINERS_TO_RUN=nas_sync, geckodriver
 DEBUG=false
 CHECK=false
 UPLOAD=false
--- a/env/upload
+++ b/env/upload
@ -1,9 +1,8 @@
 # Does not run any other workers and only upploads to archive the urls that weren't previously uploaded

-CONTAINER_DATA=~/Bulk/COSS/Downloads/auto_news.container
-HOSTS_FILE=~/Bulk/COSS/Downloads/auto_news.container/dependencies/hosts
-
+CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving

+NEWS_FETCH_DEPENDS_ON="[]"
 DEBUG=false
 CHECK=false
 UPLOAD=true
--- a/misc/hotfix_missed_messages.py
+++ b/misc/hotfix_missed_messages.py
@ -10,7 +10,7 @@ from persistence import  message_models


 # Constant values...
-MESSAGES_DB = "/app/file_storage/messages.db"
+MESSAGES_DB = "/app/containerdata/messages.db"

 BOT_ID = "U02MR1R8UJH"
 ARCHIVE_ID = "C02MM7YG1V4"
--- a/nas_sync/Dockerfile
+++ b/nas_sync/Dockerfile
@ -0,0 +1,9 @@
+FROM bash:latest
+# alpine with bash instead of sh
+ENV TZ=Europe/Berlin
+RUN apk add lsyncd cifs-utils rsync
+RUN mkdir -p /sync/remote_files
+COPY entrypoint.sh /sync/entrypoint.sh
+
+
+ENTRYPOINT ["bash", "/sync/entrypoint.sh"]
--- a/nas_sync/entrypoint.sh
+++ b/nas_sync/entrypoint.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+set -e
+
+sleep 5 # waits for the vpn to have an established connection
+echo "Starting NAS sync"
+mount -t cifs "//$1" -o credentials=/sync/nas_login.config /sync/remote_files
+echo "Successfully mounted SAMBA remote: $1 --> /sync/remote_files"
+shift # consumes the variable set in $1 so tat $@ only contains the remaining arguments
+
+exec "$@"
--- a/news_fetch/Dockerfile
+++ b/news_fetch/Dockerfile
@ -0,0 +1,27 @@
+FROM python:latest
+
+ENV TZ Europe/Zurich
+
+
+RUN apt-get update && apt-get install -y \
+evince \
+# for checking
+xauth \
+#for gui
+ghostscript
+# for compression
+
+
+RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
+# id mapped to local user
+# home directory needed for pip package installation
+RUN mkdir -p /app/auto_news
+RUN chown -R autonews:autonews /app
+USER autonews
+RUN export PATH=/home/autonews/.local/bin:$PATH
+
+COPY requirements.txt /app/requirements.txt
+RUN python3 -m pip install -r /app/requirements.txt
+
+COPY app /app/auto_news
+WORKDIR /app/auto_news
--- a/news_fetch/app/configuration.py
+++ b/news_fetch/app/configuration.py
@ -1,7 +1,9 @@
-from ast import parse
+from dataclasses import dataclass
 import os
+import shutil
 import configparser
 import logging
+from datetime import datetime
 from peewee import SqliteDatabase
 from rich.logging import RichHandler

@ -17,7 +19,7 @@ logger = logging.getLogger(__name__)

 # load config file containing constants and secrets
 parsed = configparser.ConfigParser()
-parsed.read("/app/file_storage/config.ini")
+parsed.read("/app/containerdata/config/news_fetch.config.ini")

 if os.getenv("DEBUG", "false") == "true":
    logger.warning("Found 'DEBUG=true', setting up dummy databases")
@ -28,8 +30,18 @@ if os.getenv("DEBUG", "false") == "true":
    parsed["DOWNLOADS"]["local_storage_path"] = parsed["DATABASE"]["db_path_dev"]
 else:
    logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
-
    db_base_path = parsed["DATABASE"]["db_path_prod"]
+    logger.info("Backing up databases")
+    backup_dst = parsed["DATABASE"]["db_backup"]
+    today = datetime.today().strftime("%Y.%m.%d")
+    shutil.copyfile(
+        os.path.join(db_base_path, parsed["DATABASE"]["chat_db_name"]), 
+        os.path.join(backup_dst, today + "." + parsed["DATABASE"]["chat_db_name"]), 
+        )
+    shutil.copyfile(
+        os.path.join(db_base_path, parsed["DATABASE"]["download_db_name"]), 
+        os.path.join(backup_dst, today + "." + parsed["DATABASE"]["download_db_name"]), 
+        )


 from utils_storage import models
--- a/news_fetch/app/runner.py
+++ b/news_fetch/app/runner.py
@ -4,6 +4,7 @@ models = configuration.models
 from threading import Thread
 import logging
 import os
+import sys
 logger = logging.getLogger(__name__)

 from utils_mail import runner as mail_runner
@ -14,11 +15,11 @@ from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, Up
 class ArticleWatcher:
    """Wrapper for a newly created article object. Notifies the coordinator upon change/completition"""
    def __init__(self, article, thread, **kwargs) -> None:
+        self.article_id = article.id # in case article becomes None at any point, we can still track the article
        self.article = article
        self.thread = thread

        self.completition_notifier = kwargs.get("notifier")
-
        self.fetch = kwargs.get("worker_fetch", None)
        self.download = kwargs.get("worker_download", None)
        self.compress = kwargs.get("worker_compress", None)
@ -95,13 +96,14 @@ class ArticleWatcher:
        self._upload_completed = value
        self.update_status("upload")

-
+    def __str__(self) -> str:
+        return f"Article with id {self.article_id}"


 class Coordinator(Thread):
    def __init__(self, **kwargs) -> None:
        """Launcher calls this Coordinator as the main thread to handle connections between the other workers (threaded)."""
-        super().__init__(target = self.launch)
+        super().__init__(target = self.launch, daemon=True)
    
    def add_workers(self, **kwargs):
        self.worker_slack = kwargs.pop("worker_slack", None) 
@ -154,7 +156,7 @@ class Coordinator(Thread):

        for article in articles:
            notifier = lambda article: print(f"Completed manual actions for {article}")
-            ArticleWatcher(article, workers_manual = workers, notifier = notifier)
+            ArticleWatcher(article, None, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg 

    def article_complete_notifier(self, article, thread):
        if self.worker_slack is None:
@ -191,6 +193,13 @@ if __name__ == "__main__":
            "worker_slack" : slack_runner,
            "worker_mail" : mail_runner,
        }
+        try:
            coordinator.add_workers(**kwargs)
            coordinator.start()
            slack_runner.start()
+        except KeyboardInterrupt:
+            logger.info("Keyboard interrupt. Stopping Slack and Coordinator")
+            slack_runner.stop()
+            print("BYE!")
+            # coordinator was set as a daemon thread, so it will be stopped automatically
+            sys.exit(0)
--- a/news_fetch/app/utils_check/runner.py
+++ b/news_fetch/app/utils_check/runner.py
@ -35,7 +35,7 @@ def file_overview(file_url: str, file_attributes: list, options: dict) -> None:
    file_table = Table(
        title = file_url,
        row_styles = ["white", "bright_black"],
-        min_width = 150
+        min_width = 100
    )

    file_table.add_column("Attribute", justify = "right", no_wrap = True)
@ -55,7 +55,7 @@ def file_overview(file_url: str, file_attributes: list, options: dict) -> None:


 def send_reaction_to_slack_thread(article, reaction):
-    """Sends the verification status as a reaction to the associated slack thread. This will significantly decrease load times of the bot"""
+    """Sends the verification status as a reaction to the associated slack thread."""
    thread = article.slack_thread
    messages = models.Message.select().where(models.Message.text.contains(article.article_url))
    # TODO rewrite this shit
@ -63,9 +63,10 @@ def send_reaction_to_slack_thread(article, reaction):
        print("Found more than 5 messages. Aborting reactions...")
        return
    for m in messages:
-        if not m.has_single_url:
+        if m.is_processed_override:
+            print("Message already processed. Aborting reactions...")
+        elif not m.has_single_url:
            print("Found thread but won't send reaction because thread has multiple urls")
-            pass
        else:
            ts = m.slack_ts
            bot_client.reactions_add(
@ -158,11 +159,11 @@ def verify_unchecked():
        
        try:
            # close any previously opened windows:
-            subprocess.call(["kill", "`pgrep evince`"])
+            # subprocess.call(["kill", "`pgrep evince`"])
+            os.system("pkill evince")
            # then open a new one
            subprocess.Popen(["evince", f"file://{os.path.join(article.save_path, article.file_name)}"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            # supress evince gtk warnings
-            print("done")
        except Exception as e:
            print(e)
            continue
--- a/news_fetch/app/utils_mail/runner.py
+++ b/news_fetch/app/utils_mail/runner.py
@ -37,6 +37,6 @@ def send(article_model):
        smtp.sendmail(config["sender"], config["recipient"], mail.as_string())
        smtp.quit()
        logger.info("Mail successfully sent.")
-    except Exception as e:
+    except smtplib.SMTPException as e:
        logger.error("Could not send mail for article {}".format(article_model))
        logger.info(e)
--- a/news_fetch/app/utils_slack/message_helpers.py
+++ b/news_fetch/app/utils_slack/message_helpers.py
@ -14,6 +14,7 @@ LATEST_RECORDED_REACTION = 0


 def init(client) -> None:
+    """Starts fetching past messages and returns the freshly launched thread"""
    global slack_client
    slack_client = client

@ -26,7 +27,7 @@ def init(client) -> None:
    # fetch all te messages we could have possibly missed
    logger.info("Querying missed messages, threads and reactions. This can take some time.")
    fetch_missed_channel_messages() # not threaded
-    t = Thread(target = fetch_missed_channel_reactions) # threaded, runs in background (usually takes a long time)
+    t = Thread(target = fetch_missed_channel_reactions, daemon=True) # threaded, runs in background (usually takes a long time)
    t.start()

    if os.getenv("REDUCEDFETCH", "false") == "true":
@ -153,16 +154,23 @@ def fetch_missed_channel_reactions():
    logger.info("Starting background fetch of channel reactions...")
    threads = [t for t in models.Thread.select() if not t.is_fully_processed]
    for i,t in enumerate(threads):
+        reactions = []
        try:
            query = slack_client.reactions_get(
                channel = config["archive_id"],
                timestamp = t.slack_ts
            )
-            reactions = query["message"].get("reactions", []) # default = []
-        except SlackApiError: # probably a rate_limit:
+            reactions = query.get("message", []).get("reactions", []) # default = []
+        except SlackApiError as e:
+            if e.response.get("error", "") == "message_not_found":
+                m = t.initiator_message
+                logger.warning(f"Message (id={m.id}) not found. Skipping and saving...")
+                # this usually means the message is past the 1000 message limit imposed by slack. Mark it as processed in the db
+                m.is_processed_override = True
+                m.save()
+            else: # probably a rate_limit:
                logger.error("Hit rate limit while querying reactions. retrying in {}s ({}/{} queries elapsed)".format(config["api_wait_time"], i, len(threads)))
                time.sleep(int(config["api_wait_time"]))
-            reactions = query["message"].get("reactions", [])

        for r in reactions:
            reaction_dict_to_model(r, t)
--- a/news_fetch/app/utils_slack/runner.py
+++ b/news_fetch/app/utils_slack/runner.py
@ -1,5 +1,6 @@
 from slack_bolt import App
 from slack_bolt.adapter.socket_mode import SocketModeHandler
+from slack_sdk.errors import SlackApiError

 import logging
 import configuration
@ -18,7 +19,7 @@ class BotApp(App):
        super().__init__(*args, **kwargs)
        self.callback = callback

-    def start(self):
+    def pre_start(self):
        message_helpers.init(self.client)
        missed_messages, missed_reactions = message_helpers.get_unhandled_messages()

@ -124,7 +125,7 @@ class BotApp(App):
        answers = article.slack_info
        for a in answers:
            if a["file_path"]:
-                try: # either, a["file_path"] does not exist, or the upload resulted in an error
+                try: # upload resulted in an error
                    self.client.files_upload(
                        channels = config["archive_id"],
                        initial_comment = f"<@{config['responsible_id']}> \n {a['reply_text']}",
@ -132,12 +133,13 @@ class BotApp(App):
                        thread_ts = thread.slack_ts
                    )
                    status = True
-                except:
+                except SlackApiError as e:
                    say(
                        "File {} could not be uploaded.".format(a),
                        thread_ts=thread.slack_ts
                    )
                    status = False
+                    self.logger.error(f"File upload failed: {e}")
            else: # anticipated that there is no file!
                say(
                    f"<@{config['responsible_id']}> \n {a['reply_text']}",
@ -171,14 +173,17 @@ class BotRunner():
        def handle_incoming_reaction(event, say):
            return self.bot_worker.handle_incoming_reaction(event)

-        # target = self.launch
-        # super().__init__(target=target)
+        self.handler = SocketModeHandler(self.bot_worker, config["app_token"])


    def start(self):
-        self.bot_worker.start()
-        SocketModeHandler(self.bot_worker, config["app_token"]).start()
+        self.bot_worker.pre_start()
+        self.handler.start()


+    def stop(self):
+        self.handler.close()
+        print("Bye handler!")
+
    # def respond_to_message(self, message):
    #     self.bot_worker.handle_incoming_message(message)
--- a/news_fetch/app/utils_storage/migrations/migration.001.py
+++ b/news_fetch/app/utils_storage/migrations/migration.001.py
--- a/news_fetch/app/utils_storage/models.py
+++ b/news_fetch/app/utils_storage/models.py
@ -45,7 +45,11 @@ class ArticleDownload(DownloadBaseModel):
    # ... are added through foreignkeys

    def __str__(self) -> str:
-        return f"ART [{self.title} -- {self.source_name}]"
+        if self.title != '' and self.source_name != '':
+            desc = f"{shorten_name(self.title)} -- {self.source_name}"
+        else:
+            desc = f"{self.article_url}"
+        return f"ART [{desc}]"

    ## Useful Properties
    @property
@ -255,7 +259,7 @@ class Message(ChatBaseModel):
    # reaction

    def __str__(self) -> str:
-        return "MSG [{}]".format(self.text[:min(len(self.text), 30)].replace('\n','/') + '...')
+        return "MSG [{}]".format(shorten_name(self.text).replace('\n','/'))

    @property
    def slack_ts(self):
@ -320,3 +324,8 @@ def clear_path_name(path):
    converted = "".join([c if (c.isalnum() or c in keepcharacters) else "_" for c in path]).rstrip()
    return converted

+def shorten_name(name, offset = 50):
+    if len(name) > offset:
+        return name[:offset] + "..."
+    else:
+        return name
--- a/news_fetch/app/utils_worker/_init__.py
+++ b/news_fetch/app/utils_worker/_init__.py
--- a/news_fetch/app/utils_worker/compress/runner.py
+++ b/news_fetch/app/utils_worker/compress/runner.py
--- a/news_fetch/app/utils_worker/download/init.py
+++ b/news_fetch/app/utils_worker/download/init.py
--- a/news_fetch/app/utils_worker/download/browser.py
+++ b/news_fetch/app/utils_worker/download/browser.py
@ -31,7 +31,8 @@ class PDFDownloader:
            self.logger.warning("Opening browser GUI because of 'HEADLESS=false'")

        options.set_preference('print.save_as_pdf.links.enabled', True)
-        # Just save if the filetype is pdf already, does not work!
+        # Just save if the filetype is pdf already
+        # TODO: this is not working right now

        options.set_preference("print.printer_Mozilla_Save_to_PDF.print_to_file", True)
        options.set_preference("browser.download.folderList", 2)
@ -40,6 +41,7 @@ class PDFDownloader:
        options.set_preference("browser.download.dir", config["default_download_path"])

        self.logger.info("Starting gecko driver")
+        # peviously, in a single docker image:
        # self.driver = webdriver.Firefox(
        #     options = options,
        #     service = webdriver.firefox.service.Service(
@ -153,11 +155,11 @@ class PDFDownloader:
            hrefs = [e.get_attribute("href") for e in self.driver.find_elements_by_xpath("//a[@href]")]
        except:
            hrefs = []
-        len_old = len(hrefs)
+        # len_old = len(hrefs)
        hrefs = [h for h in hrefs \
            if not sum([(domain in h) for domain in blacklisted]) # sum([True, False, False, False]) == 1 (esp. not 0)
            ] # filter a tiny bit at least
-        self.logger.info(f"Hrefs filtered (before: {len_old}, after: {len(hrefs)})")
+        # self.logger.info(f"Hrefs filtered (before: {len_old}, after: {len(hrefs)})")
        return hrefs


--- a/news_fetch/app/utils_worker/download/runner.py
+++ b/news_fetch/app/utils_worker/download/runner.py
--- a/news_fetch/app/utils_worker/download/youtube.py
+++ b/news_fetch/app/utils_worker/download/youtube.py
@ -49,17 +49,3 @@ class YouTubeDownloader:
            article_object.file_name = ""

        return article_object
-
-
-
-# class DummyArticle:
-#     article_url = "https://www.welt.de/politik/ausland/article238267261/Baerbock-Lieferung-gepanzerter-Fahrzeuge-an-die-Ukraine-kein-Tabu.html"
-#     save_path = "/app/file_storage/"
-#     fname_template = "www.youtube.com -- Test"
-#     file_name = ""
-
-# m = DummyArticle()
-# t = YouTubeDownloader()
-# t.save_video(m)
-
-# print(m.file_name)
--- a/news_fetch/app/utils_worker/fetch/runner.py
+++ b/news_fetch/app/utils_worker/fetch/runner.py
--- a/news_fetch/app/utils_worker/upload/runner.py
+++ b/news_fetch/app/utils_worker/upload/runner.py
@ -12,7 +12,6 @@ def upload_to_archive(article_object):
        archive_url = wayback.save()
        # logger.info(f"{url} uploaded to archive successfully")
        article_object.archive_url = archive_url
-        # time.sleep(4) # Archive Uploads rate limited to 15/minute

    except Exception as e:
        article_object.archive_url = "Error while uploading: {}".format(e)
--- a/news_fetch/app/utils_worker/worker_template.py
+++ b/news_fetch/app/utils_worker/worker_template.py
--- a/news_fetch/app/utils_worker/workers.py
+++ b/news_fetch/app/utils_worker/workers.py
@ -48,8 +48,8 @@ class UploadWorker(TemplateWorker):

    def _handle_article(self, article_watcher):
        def action(*args, **kwargs):
-            run_upload(*args, **kwargs)
-            time.sleep(5) # uploads to archive are throttled to 15/minute
+            time.sleep(10) # uploads to archive are throttled to 15/minute, but 5s still triggers a blacklisting
+            return run_upload(*args, **kwargs)

        super()._handle_article(article_watcher, action)
        article_watcher.upload_completed = True
--- a/news_fetch/requirements.txt
+++ b/news_fetch/requirements.txt
Author	SHA1	Message	Date
Remy Moll	40498ac8f0	upload to new gitea	2022-08-23 15:12:39 +02:00
Remy Moll	9ca4985853	Better launch, cleaner shutdown (wip)	2022-08-11 13:42:45 +02:00
Remy Moll	bc5eaba519	tiny improvements	2022-07-27 09:04:45 +02:00
Remy Moll	8e46f30f07	new component - upload to NAS	2022-07-23 17:21:00 +02:00
Remy Moll	79e3f54955	Better documentation, smoother checks	2022-06-25 16:06:45 +02:00