Working, refactored news_fetch, better documentation for launch

few bugs in news_fetch left, news_chek wip
reduced slack functionality, higher ease of use. Database migration wip
2022-09-08 16:19:15 +02:00 · 2022-09-06 22:15:26 +02:00 · 2022-09-05 16:29:19 +02:00 · 2022-08-31 12:09:21 +02:00
54 changed files with 1224 additions and 867 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,12 +1,10 @@
 .dev/
+.vscode/
 *.pyc
 *.log
 __pycache__/


-config/container.yaml
-config/local.env
-

 ## svelte:
 # Logs
@@ -25,6 +23,7 @@ dist-ssr

 # Editor directories and files
 .vscode/*
+!.vscode/extensions.json
 .idea
 .DS_Store
 *.suo
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -0,0 +1,4 @@
+{
+    "python.linting.flake8Enabled": true,
+    "python.linting.enabled": false
+}
--- a/87
+++ b/87
@@ -1,87 +0,0 @@
-include config/local.env
-export
-
-build:
-	@echo "Building..."
-	docker compose build $(flags)
-
-
-down:
-	@echo "Stopping containers..."
-	docker compose down -t 0 --volumes
-
-
-# Variables specific to debug
-debug: export DEBUG=true
-debug: export HEADFULL=true
-debug: export ENTRYPOINT=/bin/bash
-debug: export CODE=./
-debug:
-	@echo "Running in debug mode..."
-	docker compose up -d geckodriver
-	docker compose run -it --service-ports $(target) $(flags) || true
-	make down
-
-
-production: export DEBUG=false
-production:
-	@echo "Running in production mode..."
-	docker compose run -it --service-ports $(target) $(flags) || true
-	make down
-
-
-nas_sync:
-	@echo "Syncing NAS..."
-	SYNC_FOLDER=$(folder) docker compose run -it nas_sync $(flags) || true
-	docker compose down
-	docker container prune -f
-	make down
-
-
-
-
-## Misc:
-edit_profile: export CODE=./
-edit_profile: export HEADFULL=true
-edit_profile:
-	@echo "Editing profile..."
-	docker compose up -d geckodriver
-	sleep 5
-	docker compose exec geckodriver /bin/bash /code/geckodriver/edit_profile.sh || true
-	# runs inside the container
-	make down
-
-
-
-db_interface:
-	docker create \
-	--name pgadmin \
-	-p 8080:80 \
-	-e 'PGADMIN_DEFAULT_EMAIL=${UNAME}@test.com' \
-	-e 'PGADMIN_DEFAULT_PASSWORD=password' \
-	-e 'PGADMIN_CONFIG_ENHANCED_COOKIE_PROTECTION=True' \
-	-e 'PGADMIN_CONFIG_LOGIN_BANNER="Authorised users only!"' \
-	dpage/pgadmin4
-
-	docker start pgadmin
-
-	sleep 5
-
-	# TODO auto add the server to the list displayed in the browser
-	# docker exec pgadmin sh -c "echo ${SERVER_DATA} > /tmp/servers.json"
-	# docker exec pgadmin sh -c "/venv/bin/python setup.py --load-servers /tmp/servers.json --user remy@test.com"
-	@echo "Go to http://localhost:8080 to access the database interface"
-	@echo "Username: ${UNAME}@test.com"
-	@echo "Password: password"
-	@echo "Hit any key to stop (not ctrl+c)"
-	read STOP
-
-	docker stop pgadmin
-	docker rm pgadmin
-
-
-logs:
-	docker compose logs -f $(target) $(flags)
-
-
-	make down
--- a/README.md
+++ b/README.md
@@ -12,103 +12,45 @@ A utility to

 ---

-## Running - through makefile
+## Running - through launch file
+> Prerequisite: make `launch.cexecutable:
+> 
+> `chmod +x launch`

-Execute the file by runnning `make`. This won't do anything in itself. For the main usage you need to specify a mode and a target.
+Execute the file by runnning `./launch`. This won't do anything in itself. You need to specify a mode, and then a command

-`make <mode> target=<target>`
+`./launch <mode> <command> <command options>`

 ### Overview of the modes

-The production mode performs all automatic actions and therefore does not require any manual intervention. It queries the slack workspace, adds the new requests to the database, downloads all files and metadata, uploads the urls to archive.org and sends out the downloaded article. As a last step the newly created file is synced to the COSS-NAS.
+The production mode performs all automatic actions and therfore does not require any manual intervention. It queries the slack workspace, adds the new requests to the database, downloads all files and metadata, uploads the urls to archive.org and sends out the downloaded article. As a last step the newly created file is synced to the COSS-NAS.

 The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable `DEBUG=true` is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps.

 Two additional 'modes' are `build` and `down`. Build rebuilds the container, which is necessary after code changes. Down ensures a clean shutdown of *all* containers. Usually the launch-script handles this already but it sometimes fails, in which case `down` needs to be called again.


-### Overview of the targets
+### Overview of the commands

-In essence a target is simply a service from docker-compose, which is run in an interactive environment. As such all services defined in `docker-compose.yaml` can be called as target. Only two of them will be of real use:
+In essence a command is simply a service from docker-compose, which is run in an interactive environment. As such all services defined in `docker-compose.yaml` can be called as commands. Only two of them will be of real use:

-`news_fetch` does the majority of the actions mentioned above. By default, that is without any options, it runs a metadata-fetch, download, and upload to archive.org. The upload is usually the slowest which is why articles that are processed but don't yet have an archive.org url tend to pile up. You can therefore specify the option `upload` which only starts the upload for the concerned articles, as a catch-up if you will.
+`news_fetch` does the majority of the actions mentionned above. By default, that is without any options, it runs a metadata-fetch, download, compression, and upload to archive.org. The upload is usually the slowest which is why articles that are processed but don't yet have an archive.org url tend to pile up. You can therefore specify the option `upload` which only starts the upload for the concerned articles, as a catch-up if you will.

 Example usage:

 ```bash
-make production target=news_fetch # full mode
-make production target=news_fetch flags=upload # upload mode (lighter resource usage)
-make debug target=news_fetch # debug mode, which drops you inside a new shell
+./launch production news_fetch # full mode
+./launch production news_fetch upload # upload mode (lighter resource usage)
+./launch debug news_fetch # debug mode, which drops you inside a new shell

-make production target=news_check
+./launch production news_check
 ```

 `news_check` starts a webapp, accessible under [http://localhost:8080](http://localhost:8080) and allows you to easily check the downloaded articles.

-### Synchronising changes with NAS

-I recommend `rsync`.
-
-From within the ETH-network you can launch
-```
-make nas_sync folder=<target>
-```
-this will launch a docker container running `rsync` and connected to both the COSS NAS-share and your local files. Specifying a folder restricts the files that are watched for changes.
-
-example: `make nas_sync folder=2022/September` will take significantly less time than `make nas_sync folder=2022` but only considers files written to the September folder.
-
-> Please check the logs for any suspicious messages. `rsync`ing to smb shares is prone to errors.
-
-
-### Misc. usage:
-
-```bash
-make build # rebuilds all containers to reflect code changes
-make down # shuts down all containers (usually not necessary since this occurs automatically)
-make edit_profile # opens a firefox window under localhost:7900 to edit the profile used by news_fetch
-make db_interfacce # opens a postgres-interface to view the remote database (localhost:8080)
-```
-
-## First run:
-> The program relies on a functioning firefox profile!
-
-For the first run ever, run 
-
-`make edit_profile`
-
-This will generate a new firefox profile under `coss_archiving/dependencies/news_fetch.profile`.
-You can then go to [http://localhost:7900](http://localhost:7900) in your browser. Check the profile (under firefox://profile-internals).
-
-Now install two addons: Idontcareaboutcookies and bypass paywalls clean (from firefox://extensions). They ensure that most sites just work out of the box. You can additionally install adblockers such as ublock origin.
-
-You can then use this profile to further tweak various sites. The state of the sites (namely their cookies) will be used by `news_fetch`.
-
-> Whenever you need to make changes to the profile, for instance re-log in to websites, just rerun `make edit_profile`.
-
-
-## Building
-
-> The software **will** change. Because the images referenced in docker compose are usually the `latest` ones, it is sufficient to update the containers.
-
-In docker compose, run 
-
-`docker compose --env-file env/production build`
-
-Or simpler, just run
-
-`make build` (should issues occur you can also run `make build flags=--no-cache`)
-
-
-## Roadmap:
-
- [ ] handle paywalled sites like faz, spiegel, ... through their dedicated sites (see nexisuni.com for instance), available through the ETH network
- [ ] improve reliability of nas_sync. (+ logging)
- [ ] divide month folders into smaller ones
-
-
-
-## Appendix: (Running - Docker compose) 
-> I strongly recommend sticking to the usage of `make`.
+## (Running - Docker compose) 
+> I strongly recommend sticking to the usage of `./launch`.

 Instead of using the launch file you can manually issue `docker compose` comands. Example: check for logs.

@@ -133,3 +75,32 @@ docker compose --env-file env/production up # starts all services and shows thei
 docker compose --env-file env/production logs -f news_fetch # follows along with the logs of only one service
 docker compose --env-file env/production down
 ```
+
+
+## Building
+
+> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly re build the docker image! This is also crucial to update the code itself.
+
+In docker compose, run 
+
+`docker compose --env-file env/production build`
+
+Or simpler, just run
+
+`./launch build`
+
+
+
+## Roadmap:
+
+[_] handle paywalled sites like faz, spiegel, ... through their dedicated sites (see nexisuni.com for instance), available through the ETH network
+
+
+## Manual Sync to NAS:
+Manual sync is sadly still necessary, as the lsync client, sometimes gets overwhelmed by quick writes.
+
+I use `rsync`. Mounting the NAS locally, I navigate to the location of the local folder (notice the trailing slash). Then run
+`rsync -Razq --no-perms --no-owner --no-group --temp-dir=/tmp --progress --log-file=rsync.log <local folder>/ "<remote>"`
+where `<remote>` is the location where the NAS is mounted. (options:`R` - relative paths  , `a` - archive mode (multiple actions), `z` - ??, `q` - quiet. We also don't copy most of the metadata and we keep a log of the transfers.)
+
+You can also use your OS' native copy option and select *de not overwrite*. This should only copy the missing files, significantly speeding up the operation.
--- a/config/container.yaml
+++ b/config/container.yaml
@@ -1,37 +0,0 @@
-mail:
-  smtp_server: smtp.ethz.ch
-  port: 587
-  sender: "****************"
-  recipient: "****************"
-  uname: "****************"
-  password: "************"
-
-
-slack:
-  bot_id: U02MR1R8UJH
-  archive_id: C02MM7YG1V4
-  debug_id: C02NM2H9J5Q
-  api_wait_time: 90
-  auth_token: "****************"
-  app_token: "****************"
-
-
-database:
-  debug_db: /app/containerdata/debug/downloads.db
-  db_printout: /app/containerdata/backups
-  production_db_name: coss_archiving
-  production_user_name: "ca_rw"
-  production_password: "****************"
-
-  ## user_name: ca_ro
-  ## password: "****************"
-
-
-downloads:
-  local_storage_path: /app/containerdata/files
-  debug_storage_path: /app/containerdata/debug/
-  default_download_path: /app/containerdata/tmp
-  remote_storage_path: /helbing_support/Archiving-Pipeline
-  browser_profile_path: /app/containerdata/dependencies/news_fetch.profile
-  # please keep this exact name
-  browser_print_delay: 3
--- a/config/local.env
+++ b/config/local.env
@@ -1,18 +0,0 @@
-CONTAINER_DATA=***********
-UNAME=***********
-U_ID=***********
-
-DB_HOST=***********
-
-
-OPENCONNECT_URL=***********
-OPENCONNECT_USER=***********
-OPENCONNECT_PASSWORD=***********
-OPENCONNECT_OPTIONS=--authgroup student-net
-
-
-NAS_HOST=***********
-NAS_PATH=/gess_coss_1/helbing_support/Archiving-Pipeline
-NAS_USERNAME=***********
-NAS_PASSWORD=***********
-# Special characters like # need to be escaped (write: \#) 
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -4,11 +4,8 @@ services:

  vpn: # Creates a connection behind the ETH Firewall to access NAS and Postgres
    image: wazum/openconnect-proxy:latest
-    environment:
-      - OPENCONNECT_URL=${OPENCONNECT_URL}
-      - OPENCONNECT_USER=${OPENCONNECT_USER}
-      - OPENCONNECT_PASSWORD=${OPENCONNECT_PASSWORD}
-      - OPENCONNECT_OPTIONS=${OPENCONNECT_OPTIONS}
+    env_file:
+      - ${CONTAINER_DATA}/config/vpn.config
    cap_add:
    - NET_ADMIN
    volumes:
@@ -17,17 +14,31 @@ services:
    expose: ["5432"] # exposed here because db_passhtrough uses this network. See below for more details
    

+  nas_sync: # Syncs locally downloaded files with the NAS-share on nas22.ethz.ch/...
+    depends_on:
+      - vpn
+    network_mode: "service:vpn" # used to establish a connection to the SMB server from inside ETH network
+    build: nas_sync # local folder to build
+    image: nas_sync:latest
+    cap_add: # capabilities needed for mounting the SMB share
+      - SYS_ADMIN
+      - DAC_READ_SEARCH
+    volumes:
+      - ${CONTAINER_DATA}/files:/sync/local_files
+      - ${CONTAINER_DATA}/config/nas_sync.config:/sync/nas_sync.config
+      - ${CONTAINER_DATA}/config/nas_login.config:/sync/nas_login.config
+    command: 
+      - nas22.ethz.ch/gess_coss_1/helbing_support/Files RM/Archiving/TEST # first command is the target mount path
+      - lsyncd
+      - /sync/nas_sync.config
+
+
  geckodriver: # separate docker container for pdf-download. This hugely improves stability (and creates shorter build times for the containers)
-    image: selenium/standalone-firefox:latest
-    shm_size: 2gb
+    image: ${GECKODRIVER_IMG}
    environment:
      - START_VNC=${HEADFULL-false} # as opposed to headless, used when requiring supervision (eg. for websites that crash)
      - START_XVFB=${HEADFULL-false}
      - SE_VNC_NO_PASSWORD=1
-    volumes:
-      - ${CONTAINER_DATA}/dependencies:/firefox_profile/
-      - ${CODE:-/dev/null}:/code
-    user: ${U_ID}:${U_ID} # since the app writes files to the local filesystem, it must be run as the current user
    expose: ["4444"] # exposed to other docker-compose services only
    ports:
      - 7900:7900 # port for webvnc
@@ -36,7 +47,7 @@ services:
  db_passthrough: # Allows a container on the local network to connect to a service (here postgres) through the vpn
    network_mode: "service:vpn"
    image: alpine/socat:latest
-    command: ["tcp-listen:5432,reuseaddr,fork", "tcp-connect:${DB_HOST}:5432"]
+    command: ["tcp-listen:5432,reuseaddr,fork", "tcp-connect:id-hdb-psgr-cp48.ethz.ch:5432"]
    # expose: ["5432"] We would want this passthrough to expose its ports to the other containers
    # BUT since it uses the same network as the vpn-service, it can't expose ports on its own. 5432 is therefore exposed under service.vpn.expose 

@@ -44,18 +55,18 @@ services:
  news_fetch: # Orchestration of the automatic download. It generates pdfs (via the geckodriver container), fetches descriptions, triggers a snaphsot (on archive.org) and writes to a db
    build: news_fetch
    image: news_fetch:latest
+
    depends_on: # when using docker compose run news_fetch, the dependencies are started as well
+      - nas_sync
      - geckodriver
      - db_passthrough
+
    volumes:
      - ${CONTAINER_DATA}:/app/containerdata # always set
-      - ./config/container.yaml:/app/config.yaml
      - ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
    environment:
-      - CONFIG_FILE=/app/config.yaml
      - DEBUG=${DEBUG}
      - UNAME=${UNAME}
-    user: ${U_ID}:${U_ID} # since the app writes files to the local filesystem, it must be run as the current user
    entrypoint: ${ENTRYPOINT:-python runner.py} # by default launch workers as defined in the Dockerfile
    # stdin_open: ${INTERACTIVE:-false} # docker run -i
    # tty: ${INTERACTIVE:-false}        # docker run -t
@@ -64,38 +75,15 @@ services:
  news_check: # Creates a small webapp on http://localhost:8080 to check previously generated pdfs (some of which are unusable and must be marked as such)
    build: news_check
    image: news_check:latest
-    user: ${U_ID}:${U_ID} # since the app writes files to the local filesystem, it must be run as the current user
+    # user: 1001:1001 # since the app writes files to the local filesystem, it must be run as the current user
    depends_on:
      - db_passthrough
    volumes:
      - ${CONTAINER_DATA}:/app/containerdata # always set
-      - ./config/container.yaml:/app/config.yaml
      - ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
    environment:
-      - CONFIG_FILE=/app/config.yaml
      - UNAME=${UNAME}
    ports:
      - "8080:80" # 80 inside container
    entrypoint: ${ENTRYPOINT:-python app.py} # by default launch workers as defined in the Dockerfile
-
-
-  nas_sync:
-    image: alpine:latest
-    volumes:
-      - ${CONTAINER_DATA}/files:/sync/local_files
-      - coss_smb_share:/sync/remote_files
-    command:
-      - /bin/sh
-      - -c
-      - |
-        apk add rsync
-        rsync -av --no-perms --no-owner --no-group --progress /sync/local_files/${SYNC_FOLDER}/ /sync/remote_files/${SYNC_FOLDER} -n
-
-
-volumes:
-  coss_smb_share:
-    driver: local
-    driver_opts:
-      type: cifs
-      o: "addr=${NAS_HOST},nounix,file_mode=0777,dir_mode=0777,domain=D,username=${NAS_USERNAME},password=${NAS_PASSWORD}"
-      device: //${NAS_HOST}${NAS_PATH}
+    tty: true
--- a/config/env/debug
+++ b/config/env/debug
@@ -1,9 +1,9 @@
 # Runs in a debugging mode, does not launch anything at all but starts a bash process

-export CONTAINER_DATA=/mnt/media/@Bulk/COSS/Downloads/coss_archiving
+export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
 export UNAME=remy
-export U_ID=1000

+export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
 export DEBUG=true
 export HEADFULL=true
 export CODE=./
--- a/config/env/production
+++ b/config/env/production
@@ -1,7 +1,7 @@
 # Runs on the main slack channel with the full worker setup. If nothing funky has occured, reducedfetch is a speedup

-CONTAINER_DATA=/mnt/media/@Bulk/COSS/Downloads/coss_archiving
+CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving

 export UNAME=remy
-export U_ID=1000
+export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
 export DEBUG=false
--- a/geckodriver/edit_profile.sh
+++ b/geckodriver/edit_profile.sh
@@ -1,8 +0,0 @@
-if [ -d "/firefox_profile/news_fetch.profile" ] 
-then
-    echo "Profile already exists, skipping folder creation"
-else
-    echo "Creating empty folder for profile"
-    mkdir -p /firefox_profile/news_fetch.profile/  
-fi
-firefox --profile /firefox_profile/news_fetch.profile
--- a/46
+++ b/46
@@ -0,0 +1,46 @@
+#!/bin/bash
+set -e
+set -o ignoreeof
+
+echo "Bash script launching COSS_ARCHIVING..."
+
+
+# CHANGE ME ONCE!
+export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
+export UNAME=remy
+# CHANGE ME WHEN UPDATING FIREFOX
+export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
+# version must be >= than the one on the host or firefox will not start (because of mismatched config)
+
+if [[ $1 == "debug" ]]
+then
+    export DEBUG=true
+    export HEADFULL=true
+    export CODE=./
+    export ENTRYPOINT=/bin/bash
+    # since service ports does not open ports on implicitly started containers, also start geckodriver:
+    docker compose up -d geckodriver
+elif [[ $1 == "production" ]]
+then
+    export DEBUG=false
+elif [[ $1 == "build" ]]
+then
+    export DEBUG=false
+    docker compose build
+    exit 0
+elif [[ $1 == "down" ]]
+then
+    docker compose stop
+    exit 0
+else
+    echo "Please specify the execution mode (debug/production/build) as the first argument"
+    exit 1
+fi
+
+shift # consumes the variable set in $1 so that $@ only contains the remaining arguments
+
+docker compose run -it --service-ports "$@"
+
+echo "Docker run finished, shutting down containers..."
+docker compose stop
+echo "Bye!"
--- a/manual/README.md
+++ b/manual/README.md
@@ -1,7 +0,0 @@
-### MANUAL TASKS
-
-The files inside this directory contain scripts for repetitive but somewhat automatable tasks.
-
-> ⚠️ warning:
-> 
-> Most scripts still require manual intervention before/after running and probably require changes to the code. **Please make sure you understand them before using them!**
--- a/manual/batch_archive.py
+++ b/manual/batch_archive.py
@@ -1,21 +0,0 @@
-"""
-Saves websites specified in 'batch_urls.txt' to the wayback machine. Outputs archive urls to terminal
-Hint: use 'python batch_archive.py > batch_archive.txt' to save the output to a file
-"""
-from waybackpy import WaybackMachineSaveAPI # upload to archive.org
-import time
-
-urls = []
-with open ("batch_urls.txt", "r") as f:
-    urls = f.readlines()
-
-
-
-for i, url in enumerate(urls):
-    print(f"Saving url {i+1} / {len(urls)}")
-    user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0" # needed?
-    wayback = WaybackMachineSaveAPI(url, user_agent)
-    archive_url = wayback.save()
-    print(archive_url)
-    time.sleep(20)
-    # Uploads to archive.org are rate limited
--- a/manual/batch_urls.txt
+++ b/manual/batch_urls.txt
@@ -1,18 +0,0 @@
-https://id2020.org
-https://www.weforum.org/platforms/the-centre-for-cybersecurity
-https://www.unhcr.org/blogs/wp-content/uploads/sites/48/2018/04/fs.pdf
-https://en.wikipedia.org/wiki/Social_Credit_System
-https://en.wikipedia.org/wiki/Customer_lifetime_value
-https://www.weforum.org/reports/the-internet-of-bodies-is-here-tackling-new-challenges-of-technology-governance
-https://www.un.org/en/about-us/universal-declaration-of-human-rights
-https://www.biometricupdate.com/201909/id2020-and-partners-launch-program-to-provide-digital-id-with-vaccines
-https://www.wired.com/2008/06/pb-theory/
-https://www.medtechdive.com/news/fda-warns-of-false-positives-with-bd-coronavirus-diagnostic/581115/
-https://www.bbc.com/news/world-middle-east-52579475
-https://www.timesofisrael.com/over-12000-mistakenly-quarantined-by-phone-tracking-health-ministry-admits/
-https://www.delftdesignforvalues.nl
-https://www.theglobalist.com/technology-big-data-artificial-intelligence-future-peace-rooms/
-https://link.springer.com/chapter/10.1007/978-3-319-90869-4_17
-https://www.youtube.com/watch?v=_KhAsJRk2lo
-https://www.bloomberg.org/environment/supporting-sustainable-cities/american-cities-climate-challenge/
-https://climatecitycup.org
--- a/manual/batch_youtube.py
+++ b/manual/batch_youtube.py
@@ -1,33 +0,0 @@
-"""
-Saves youtube videos specified in 'batch_urls.txt' to the local folder. (to be copied manually)
-"""
-import youtube_dl
-
-urls = []
-with open ("batch_urls.txt", "r") as f:
-    urls = f.readlines()
-
-
-def post_download_hook(ret_code):
-    if ret_code['status'] == 'finished':
-        file_loc = ret_code["filename"]
-        print(file_loc)
-
-
-def save_video(url):
-    """Saves video accoring to url and save path"""
-    ydl_opts = {
-        'format': 'best[height<=720]',
-        'progress_hooks': [post_download_hook],
-        'updatetime': False
-    }
-    try:
-        with youtube_dl.YoutubeDL(ydl_opts) as ydl:
-            ydl.download([url])
-    except Exception as e:
-        print(f"Youtube download crashed: {e}")
-
-
-for i, url in enumerate(urls):
-    print(f"Downloading video {i+1} / {len(urls)}")
-    save_video(url)
--- a/manual/gather_media_files.py
+++ b/manual/gather_media_files.py
@@ -1,70 +0,0 @@
-"""
-Runs the news_fetch pipeline against a manually curated list of urls and saves them locally
-"""
-import sys
-sys.path.append("../news_fetch")
-import runner
-import os
-import logging
-logger = logging.getLogger()
-
-
-class DummyMessage:
-    """Required by the dispatcher"""
-    ts = 0
-    def __init__(self, url):
-        self.urls = [url]
-
-
-def fetch():
-    dispatcher = runner.Dispatcher()
-
-    dispatcher.workers_in = [
-        {"FetchWorker": runner.FetchWorker(), "DownloadWorker": runner.DownloadWorker()},
-        {"UploadWorker": runner.UploadWorker()}
-    ]
-    print_worker = runner.PrintWorker("Finished processing", sent = True)
-    dispatcher.workers_out = [{"PrintWorker": print_worker}]
-
-    dispatcher.start()
-
-
-    with open("media_urls.txt", "r") as f:
-        url_list = [l.replace("\n", "") for l in f.readlines()]
-    with open("media_urls.txt", "w") as f:
-        f.write("") # empty the file once it is read so that it does not get processed again
-
-    if url_list:
-        logger.info(f"Found {len(url_list)} media urls")
-        for u in url_list:
-            dispatcher.incoming_request(DummyMessage(u))
-    else:
-        logger.info(f"No additional media urls found. Running the pipeline with messages from db.")
-
-    print_worker.keep_alive()
-
-
-def show():
-    for a in runner.models.ArticleDownload.select():
-        print(f"""
-        URL: {a.article_url}
-        ARCHIVE_URL: {a.archive_url}
-        ARTICLE_SOURCE: {a.source_name}
-        FILE_NAME: {a.file_name}
-        """)
-
-
-if __name__ == "__main__":
-    logger.info("Overwriting production values for single time media-fetch")
-    if not os.path.exists("../.dev/"):
-        os.mkdir("../.dev/")
-    runner.configuration.models.set_db(
-        runner.configuration.SqliteDatabase("../.dev/media_downloads.db")
-    )
-    runner.configuration.main_config["downloads"]["local_storage_path"] = "../.dev/"
-
-
-    if len(sys.argv) == 1: # no additional arguments
-        fetch()
-    elif sys.argv[1] == "show":
-        show()
--- a/manual/media_urls.txt
+++ b/manual/media_urls.txt
--- a/manual/exctract_from_mail_backup.py
+++ b/manual/exctract_from_mail_backup.py
@@ -1,6 +1,3 @@
-"""
-Extracts all urls from a list of mails exported from thunderbird. Writes to 'mails_url_export.json'
-"""
 import os
 import re
 import json
@@ -22,5 +19,5 @@ for f in all_files:

 print("Saved {} urls".format(len(all_urls)))

-with open("mails_url_export.json", "w") as f:
+with open("media_mails_export.json", "w") as f:
    json.dump(all_urls, f)  
--- a/misc/gather_media_files.py
+++ b/misc/gather_media_files.py
@@ -0,0 +1,72 @@
+import sys
+sys.path.append("../app")
+import runner
+import logging
+logger = logging.getLogger()
+import json
+
+from rich.console import Console
+from rich.table import Table
+console = Console()
+
+logger.info("Overwriting production values for single time media-fetch")
+runner.configuration.models.set_db(
+    runner.configuration.SqliteDatabase("../.dev/media_message_dummy.db"),  # chat_db (not needed here)
+    runner.configuration.SqliteDatabase("../.dev/media_downloads.db")
+)
+runner.configuration.main_config["DOWNLOADS"]["local_storage_path"] = "../.dev/"
+
+
+def fetch():
+    coordinator = runner.Coordinator()
+
+
+    kwargs = {
+        "worker_download" : runner.DownloadWorker(),
+        "worker_fetch" : runner.FetchWorker(),
+        "worker_upload" : runner.UploadWorker(),
+        "worker_compress" : runner.CompressWorker(),
+    }
+
+    coordinator.add_workers(**kwargs)
+    coordinator.start()
+
+    with open("media_urls.json", "r") as f:
+        url_list = json.loads(f.read()) 
+
+    logger.info(f"Found {len(url_list)} media urls")
+    for u in url_list:
+        msg_text = f"<{u}|dummy preview text>"
+        dummy_thread = runner.models.Thread()
+        msg = runner.models.Message(text= msg_text, thread=dummy_thread)
+        coordinator.incoming_request(msg)
+
+
+def show():
+
+    t = Table(
+        title = "ArticleDownloads",
+        row_styles = ["white", "bright_black"],
+    )
+    
+    entries = ["title", "article_url", "archive_url", "authors"]
+
+    for e in entries:
+        t.add_column(e, justify = "right")
+
+    sel = runner.models.ArticleDownload.select()
+
+    for s in sel:
+        c = [getattr(s, e) for e in entries]#
+        c[-1] = str([a.author for a in c[-1]])
+        print(c)
+        t.add_row(*c)
+
+    
+    console.print(t)
+
+
+
+
+# fetch()
+show()
--- a/misc/hotfix_missed_messages.py
+++ b/misc/hotfix_missed_messages.py
@@ -0,0 +1,88 @@
+import time
+import keys
+import slack_sdk
+from slack_sdk.errors import SlackApiError
+from peewee import SqliteDatabase
+
+from persistence import  message_models
+# from bot_utils import messages
+
+
+
+# Constant values...
+MESSAGES_DB = "/app/containerdata/messages.db"
+
+BOT_ID = "U02MR1R8UJH"
+ARCHIVE_ID = "C02MM7YG1V4"
+DEBUG_ID = "C02NM2H9J5Q"
+
+
+
+client = slack_sdk.WebClient(token=keys.OAUTH_TOKEN)
+
+message_models.set_db(SqliteDatabase(MESSAGES_DB))
+
+
+def message_dict_to_model(message):
+    if message["type"] == "message":
+        thread_ts = message["thread_ts"] if "thread_ts" in message else message["ts"]
+        uid = message.get("user", "BAD USER")
+        user, _ = message_models.User.get_or_create(user_id = uid)
+        thread, _ = message_models.Thread.get_or_create(thread_ts = thread_ts)
+        m, new = message_models.Message.get_or_create(
+            user = user,
+            thread = thread,
+            ts = message["ts"],
+            channel_id = ARCHIVE_ID,
+            text = message["text"]
+        )
+        print("Saved (text) {} (new={})".format(m, new))
+
+        for f in message.get("files", []): #default: []
+            m.file_type = f["filetype"]
+            m.perma_link = f["url_private_download"]
+            m.save()
+            print("Saved permalink {} to {} (possibly overwriting)".format(f["name"], m))
+        if new:
+            return m
+        else:
+            return None
+    else:
+        print("What should I do of {}".format(message))
+        return None
+
+
+def check_all_past_messages():
+    last_ts = 0
+    
+    result = client.conversations_history(
+        channel=ARCHIVE_ID,
+        oldest=last_ts
+    )
+
+    new_messages = result.get("messages", []) # fetches 100 messages by default
+
+    new_fetches = []
+    for m in new_messages:
+        new_fetches.append(message_dict_to_model(m))
+    # print(result)
+    refetch = result.get("has_more", False)
+    print(f"Refetching : {refetch}")
+    while refetch: # we have not actually fetched them all
+        try:
+            result = client.conversations_history(
+                channel = ARCHIVE_ID,
+                cursor = result["response_metadata"]["next_cursor"],
+                oldest = last_ts
+            ) # refetches in batches of 100 messages
+            refetch = result.get("has_more", False)
+            new_messages = result.get("messages", [])
+            for m in new_messages:
+                new_fetches.append(message_dict_to_model(m))
+        except SlackApiError: # Most likely a rate-limit
+            print("Error while fetching channel messages. (likely rate limit) Retrying in {} seconds...".format(30))
+            time.sleep(30)
+            refetch = True
+
+
+check_all_past_messages()
--- a/misc/hotfix_reactions.py
+++ b/misc/hotfix_reactions.py
@@ -0,0 +1,38 @@
+from peewee import SqliteDatabase
+
+from persistence import article_models, message_models
+
+# Global logger setup:
+
+
+# Constant values...
+DOWNLOADS_DB = "../container_data/downloads.db"
+MESSAGES_DB = "../container_data/messages.db"
+
+BOT_ID = "U02MR1R8UJH"
+ARCHIVE_ID = "C02MM7YG1V4"
+DEBUG_ID = "C02NM2H9J5Q"
+
+
+# DB Setup:
+article_models.set_db(SqliteDatabase(
+    DOWNLOADS_DB,
+    pragmas = {'journal_mode': 'wal'} # mutliple threads can access at once
+))
+
+message_models.set_db(SqliteDatabase(MESSAGES_DB))
+
+
+
+for reaction in message_models.Reaction.select():
+    print(reaction)        
+    thread = reaction.message.thread
+    articles = message_models.get_referenced_articles(thread, article_models.ArticleDownload)
+    for a in articles:
+        print(a)
+    reaction = reaction.type
+    status = 1 if reaction == "white_check_mark" else -1
+    print(status)
+    for article in articles:
+        article.verified = status
+        article.save()
--- a/misc/media_urls.json
+++ b/misc/media_urls.json
@@ -0,0 +1,151 @@
+[
+    "https://www.swissinfo.ch/ger/wirtschaft/koennen-ki-und-direkte-demokratie-nebeneinander-bestehen-/47542048",
+    "https://www.zeit.de/2011/33/CH-Oekonophysik",
+    "https://ourworld.unu.edu/en/green-idea-self-organizing-traffic-signals",
+    "https://www.youtube.com/watch?v=-FQD4ie9UYA",
+    "https://www.brandeins.de/corporate-services/mck-wissen/mck-wissen-logistik/schwaermen-fuer-das-optimum",
+    "https://www.youtube.com/watch?v=upQM4Xzh8zM",
+    "https://www.youtube.com/watch?v=gAkoprZmW4k",
+    "https://www.youtube.com/watch?v=VMzfDVAWXHI&t=1s",
+    "https://www.youtube.com/watch?v=1SwTiIlkndE",
+    "https://www.informatik-aktuell.de/management-und-recht/digitalisierung/digitale-revolution-und-oekonomie-40-quo-vadis.html",
+    "https://www.youtube.com/watch?v=cSvvH0SBFOw",
+    "https://www.linkedin.com/posts/margit-osterloh-24198a104_pl%C3%A4doyer-gegen-sprechverbote-ugcPost-6925702100450480129-K7Dl?utm_source=linkedin_share&utm_medium=member_desktop_web",
+    "https://www.nebelspalter.ch/plaedoyer-gegen-sprechverbote",
+    "https://falling-walls.com/people/dirk-helbing/",
+    "https://digitalsensemaker.podigee.io/3-2-mit-dirk-helbing",
+    "https://www.blick.ch/wirtschaft/musk-als-hueter-der-redefreiheit-eth-experte-sagt-musks-vorhaben-hat-potenzial-aber-id17437811.html",
+    "https://www.trend.at/standpunkte/mit-verantwortung-zukunft-10082300",
+    "https://www.pantarhei.ch/podcast/",
+    "https://ethz.ch/en/industry/industry/news/data/2022/04/intelligent-traffic-lights-for-optimal-traffic-flow.html",
+    "https://ethz.ch/de/wirtschaft/industry/news/data/2022/04/optimaler-verkehrsfluss-mit-intelligenten-ampeln.html",
+    "https://www.spektrum.de/news/die-verschlungenen-wege-der-menschen/1181815",
+    "https://www.pcwelt.de/a/diktatur-4-0-schoene-neue-digitalisierte-welt,3447005",
+    "https://www.nzz.ch/english/cancel-culture-at-eth-a-professor-receives-death-threats-over-a-lecture-slide-ld.1675322",
+    "https://www.brandeins.de/corporate-services/mck-wissen/mck-wissen-logistik/schwaermen-fuer-das-optimum",
+    "https://www.achgut.com/artikel/ausgestossene_der_woche_prinz_william_als_immaginierter_rassist",
+    "https://www.pinterpolitik.com/in-depth/klaim-big-data-luhut-perlu-diuji/",
+    "https://www.srf.ch/kultur/gesellschaft-religion/eklat-an-der-eth-wenn-ein-angeblicher-schweinevergleich-zur-staatsaffaere-wird",
+    "https://open.spotify.com/episode/6s1icdoplZeNOINvx6ZHTd?si=610a699eba004da2&nd=1",
+    "https://www.nzz.ch/schweiz/shitstorm-an-der-eth-ein-professor-erhaelt-morddrohungen-ld.1673554",
+    "https://www.nzz.ch/schweiz/shitstorm-an-der-eth-ein-professor-erhaelt-morddrohungen-ld.1673554",
+    "https://djmag.com/features/after-astroworld-what-being-done-stop-crowd-crushes-happening-again",
+    "https://prisma-hsg.ch/articles/meine-daten-deine-daten-unsere-daten/",
+    "https://www.srf.ch/audio/focus/zukunftsforscher-dirk-helbing-die-welt-ist-keine-maschine?id=10756661",
+    "https://www.20min.ch/story/roboter-fuer-hunde-machen-wenig-sinn-647302764916",
+    "https://www.wienerzeitung.at/nachrichten/wissen/mensch/942890-Roboter-als-Praesidentschaftskandidaten.html",
+    "https://disruptors.fm/11-building-a-crystal-ball-of-the-world-unseating-capitalism-and-creating-a-new-world-order-with-prof-dirk-helbing/",
+    "https://www.spreaker.com/user/disruptorsfm/11-building-crystal-ball-of-the-world-un",
+    "https://www.youtube.com/watch?v=fRkCMC3zqSQ",
+    "https://arstechnica.com/science/2021/11/what-the-physics-of-crowds-can-tell-us-about-the-tragic-deaths-at-astroworld/",
+    "https://www.fox23.com/news/trending/astroworld-festival-big-crowds-can-flow-like-liquid-with-terrifying-results/37QH6Q4RGFELHGCZSZTBV46STU/",
+    "https://futurism.com/astroworld-theory-deaths-bodies-fluid",
+    "https://www.businessinsider.com/why-people-died-astroworld-crowd-crush-physics-fluid-dynamics-2021-11",
+    "https://theconversation.com/ten-tips-for-surviving-a-crowd-crush-112169",
+    "https://www.limmattalerzeitung.ch/basel/das-wort-zum-tag-kopie-von-4-januar-hypotenuse-schlaegt-kathete-trivia-trampel-pandemie-ld.2233931",
+    "https://magazine.swissinformatics.org/en/whats-wrong-with-ai/",
+    "https://magazine.swissinformatics.org/en/whats-wrong-with-ai/",
+    "https://www.netkwesties.nl/1541/wrr-ai-wordt-de-verbrandingsmotor-van.htm",
+    "https://youtu.be/ptm9zLG2KaE",
+    "https://www.deutschlandfunkkultur.de/die-zukunft-der-demokratie-mehr-teilhabe-von-unten-wagen.976.de.html?dram:article_id=468341",
+    "https://www.springer.com/gp/book/9783642240034",
+    "https://www.springer.com/de/book/9783319908687",
+    "https://technikjournal.de/2017/08/02/ein-plaedoyer-fuer-die-digitale-demokratie/",
+    "https://technikjournal.de/2017/08/02/ein-plaedoyer-fuer-die-digitale-demokratie/",
+    "https://trafo.hypotheses.org/23989",
+    "https://web.archive.org/web/20200609053329/https://www.wiko-berlin.de/institution/projekte-kooperationen/projekte/working-futures/wiko-briefs-working-futures-in-corona-times/the-corona-crisis-reveals-the-struggle-for-a-sustainable-digital-future/",
+    "https://www.wiko-berlin.de/institution/projekte-kooperationen/projekte/working-futures/wiko-briefs-working-futures-in-corona-times/",
+    "https://www.youtube.com/watch?v=gAkoprZmW4k",
+    "https://www.rhein-zeitung.de/region/aus-den-lokalredaktionen/nahe-zeitung_artikel,-peter-flaschels-lebenswerk-hat-die-sozialgeschichte-beeinflusst-_arid,2322161.html",
+    "https://www.blick.ch/wirtschaft/online-boom-ohne-ende-corona-befeuert-die-tech-revolution-id16359910.html",
+    "https://www.nzz.ch/meinung/china-unterwirft-tech-und-social-media-das-geht-auch-europa-an-ld.1643010",
+    "https://www.say.media/article/la-mort-par-algorithme",
+    "https://www.suedostschweiz.ch/aus-dem-leben/2021-08-14/stau-ist-nicht-gleich-stau",
+    "https://www.swissinfo.ch/eng/directdemocracy/political-perspectives_digital-democracy--too-risky--or-the-chance-of-a-generation-/43836222",
+    "https://kow-berlin.com/exhibitions/illusion-einer-menschenmenge",
+    "https://www.springer.com/gp/book/9783642240034",
+    "https://www.springer.com/de/book/9783319908687",
+    "https://www.politik-kommunikation.de/ressorts/artikel/eine-gefaehrliche-machtasymmetrie-1383558602",
+    "https://www.springer.com/gp/book/9783642240034",
+    "https://www.springer.com/de/book/9783319908687",
+    "https://solutions.hamburg/ethik-und-digitalisierung-nicht-voneinander-getrennt-betrachten/",
+    "https://www.springer.com/gp/book/9783642240034",
+    "https://www.springer.com/de/book/9783319908687",
+    "https://avenue.argusdatainsights.ch/Article/AvenueClip?artikelHash=d14d91ec9a8b4cb0b6bb3012c0cefd8b_27F0B19422F1F03723769C18906AA1EE&artikelDateiId=298862327",
+    "https://www.tagblatt.ch/kultur/grosses-ranking-ihre-stimme-hat-gewicht-das-sind-die-50-profiliertesten-intellektuellen-der-schweiz-ld.2182261",
+    "https://reliefweb.int/report/world/building-multisystemic-understanding-societal-resilience-covid-19-pandemic",
+    "https://reliefweb.int/report/world/building-multisystemic-understanding-societal-resilience-covid-19-pandemic",
+    "https://www.events.at/e/wie-wir-in-zukunft-leben-wollen-die-stadt-als-datenfeld",
+    "https://www.events.at/e/wie-wir-in-zukunft-leben-wollen-die-stadt-als-datenfeld",
+    "https://greennetproject.org/en/2018/11/27/prof-dirk-helbing-es-braucht-vor-allem-tolle-ideen-in-die-sich-die-leute-verlieben/",
+    "https://www.hpcwire.com/2011/05/06/simulating_society_at_the_global_scale/",
+    "https://www.technologyreview.com/2010/04/30/204005/europes-plan-to-simulate-the-entire-planet/",
+    "https://komentare.sme.sk/c/22543617/smrt-podla-algoritmu.html",
+    "https://komentare.sme.sk/c/22543617/smrt-podla-algoritmu.html",
+    "https://www.confidencial.com.ni/opinion/muerte-por-algoritmo/",
+    "https://www.nzz.ch/panorama/wie-kann-eine-massenpanik-verhindert-werden-ld.1614761",
+    "https://www.20min.ch/story/roboter-fuer-hunde-machen-wenig-sinn-647302764916",
+    "https://www.wienerzeitung.at/nachrichten/wissen/mensch/942890-Roboter-als-Praesidentschaftskandidaten.html",
+    "https://www.srf.ch/audio/focus/zukunftsforscher-dirk-helbing-die-welt-ist-keine-maschine?id=10756661",
+    "https://disruptors.fm/11-building-a-crystal-ball-of-the-world-unseating-capitalism-and-creating-a-new-world-order-with-prof-dirk-helbing/",
+    "https://www.spreaker.com/user/disruptorsfm/11-building-crystal-ball-of-the-world-un",
+    "https://www.youtube.com/watch?v=fRkCMC3zqSQ",
+    "https://arstechnica.com/science/2021/11/what-the-physics-of-crowds-can-tell-us-about-the-tragic-deaths-at-astroworld/",
+    "https://www.fox23.com/news/trending/astroworld-festival-big-crowds-can-flow-like-liquid-with-terrifying-results/37QH6Q4RGFELHGCZSZTBV46STU/",
+    "https://futurism.com/astroworld-theory-deaths-bodies-fluid",
+    "https://www.businessinsider.com/why-people-died-astroworld-crowd-crush-physics-fluid-dynamics-2021-11",
+    "https://theconversation.com/ten-tips-for-surviving-a-crowd-crush-112169",
+    "https://www.limmattalerzeitung.ch/basel/das-wort-zum-tag-kopie-von-4-januar-hypotenuse-schlaegt-kathete-trivia-trampel-pandemie-ld.2233931",
+    "https://www.pantarhei.ch/podcast/",
+    "https://www.focus.it/scienza/scienze/folla-fisica-modelli-simulazioni",
+    "https://www.focus.it/scienza/scienze/folla-fisica-modelli-simulazioni",
+    "https://www.netkwesties.nl/1541/wrr-ai-wordt-de-verbrandingsmotor-van.htm",
+    "https://www.transformationbeats.com/de/transformation/digitale-gesellschaft/",
+    "https://www.transformationbeats.com/de/transformation/digitale-gesellschaft/",
+    "https://www.suedkurier.de/ueberregional/wirtschaft/Wie-uns-der-Staat-heimlich-erzieht-sogar-auf-dem-Klo;art416,8763904",
+    "https://www.suedkurier.de/ueberregional/wirtschaft/Wie-uns-der-Staat-heimlich-erzieht-sogar-auf-dem-Klo;art416,8763904",
+    "https://www.deutschlandfunkkultur.de/die-zukunft-der-demokratie-mehr-teilhabe-von-unten-wagen.976.de.html?dram:article_id=468341",
+    "https://www.springer.com/gp/book/9783642240034",
+    "https://www.springer.com/de/book/9783319908687",
+    "https://trafo.hypotheses.org/23989",
+    "https://web.archive.org/web/20200609053329/https://www.wiko-berlin.de/institution/projekte-kooperationen/projekte/working-futures/wiko-briefs-working-futures-in-corona-times/the-corona-crisis-reveals-the-struggle-for-a-sustainable-digital-future/",
+    "https://www.wiko-berlin.de/institution/projekte-kooperationen/projekte/working-futures/wiko-briefs-working-futures-in-corona-times/",
+    "https://www.youtube.com/watch?v=gAkoprZmW4k",
+    "https://futurium.de/de/gespraech/ranga-yogeshwar-1/ranga-yogeshwar-dirk-helbing-mit-musik-von-till-broenner",
+    "https://www.springer.com/gp/book/9783642240034",
+    "https://www.springer.com/de/book/9783319908687",
+    "https://idw-online.de/en/news113518",
+    "https://blmplus.de/die-digitalcharta-ist-erst-der-anfang-ein-szenario-von-dirk-helbing/",
+    "https://www.risiko-dialog.ch/big-nudging-vom-computer-gelenkt-aber-wohin/",
+    "https://idw-online.de/de/news13986",
+    "https://www.uni-stuttgart.de/presse/archiv/uni-kurier/uk84_85/forschung/fw66.html",
+    "https://www.infosperber.ch/medien/trends/rankings-oft-unbrauchbar-so-oder-so-aber-immer-schadlich/",
+    "https://www.infosperber.ch/medien/trends/rankings-oft-unbrauchbar-so-oder-so-aber-immer-schadlich/",
+    "https://www.nzz.ch/meinung/china-unterwirft-tech-und-social-media-das-geht-auch-europa-an-ld.1643010",
+    "https://www.suedostschweiz.ch/aus-dem-leben/2021-08-14/stau-ist-nicht-gleich-stau",
+    "https://www.swissinfo.ch/eng/directdemocracy/political-perspectives_digital-democracy--too-risky--or-the-chance-of-a-generation-/43836222",
+    "https://werteundwandel.de/inhalte/d2030-in-aufbruchstimmung-fuer-eine-lebenswerte-zukunft/",
+    "https://www.springer.com/gp/book/9783642240034",
+    "https://www.springer.com/de/book/9783319908687",
+    "https://www.youtube.com/watch?v=n9e77iYZPEY",
+    "https://greennetproject.org/en/2018/11/27/prof-dirk-helbing-es-braucht-vor-allem-tolle-ideen-in-die-sich-die-leute-verlieben/",
+    "https://www.hpcwire.com/2011/05/06/simulating_society_at_the_global_scale/",
+    "https://www.say.media/article/la-mort-par-algorithme",
+    "https://www.confidencial.com.ni/opinion/muerte-por-algoritmo/",
+    "https://www.nzz.ch/panorama/wie-kann-eine-massenpanik-verhindert-werden-ld.1614761",
+    "https://www.nesta.org.uk/report/digital-democracy-the-tools-transforming-political-engagement/",
+    "https://www.nature.com/articles/news.2010.351",
+    "https://www.focus.de/panorama/welt/tid-19265/gastkommentar-nutzt-die-moeglichkeiten-des-computers_aid_534372.html",
+    "https://www.theglobalist.com/democracy-technology-innovation-society-internet/",
+    "https://www.theglobalist.com/capitalism-democracy-technology-surveillance-privacy/",
+    "https://www.theglobalist.com/google-artificial-intelligence-big-data-technology-future/",
+    "https://www.theglobalist.com/fascism-big-data-artificial-intelligence-surveillance-democracy/",
+    "https://www.theglobalist.com/technology-big-data-artificial-intelligence-future-peace-rooms/",
+    "https://www.theglobalist.com/technology-society-sustainability-future-humanity/",
+    "https://www.theglobalist.com/society-technology-peace-sustainability/",
+    "https://www.theglobalist.com/democracy-technology-social-media-artificial-intelligence/",
+    "https://www.theglobalist.com/financial-system-reform-economy-internet-of-things-capitalism/",
+    "https://www.theglobalist.com/capitalism-society-equality-sustainability-crowd-funding/",
+    "https://www.theglobalist.com/united-nations-world-government-peace-sustainability-society/",
+    "https://www.theglobalist.com/world-economy-sustainability-environment-society/"
+]
--- a/manual/migration.to_postgres.py
+++ b/manual/migration.to_postgres.py
--- a/misc/sample_config/nas_login.config
+++ b/misc/sample_config/nas_login.config
@@ -0,0 +1,3 @@
+user=****
+domain=D
+password=**************
--- a/misc/sample_config/nas_sync.config
+++ b/misc/sample_config/nas_sync.config
@@ -0,0 +1,12 @@
+settings {
+   logfile    = "/tmp/lsyncd.log",
+   statusFile = "/tmp/lsyncd.status",
+   nodaemon   = true,
+}
+
+sync {
+   default.rsync,
+   source = "/sync/local_files",
+   target = "/sync/remote_files",
+   init = false,
+}
--- a/misc/sample_config/news_fetch.config.ini
+++ b/misc/sample_config/news_fetch.config.ini
@@ -0,0 +1,33 @@
+[MAIL]
+smtp_server: smtp.******
+port: 587
+sender: **************
+recipient: **************
+uname: **************
+password: **************+
+
+
+[SLACK]
+bot_id: U02MR1R8UJH
+responsible_id: U01AC9ZEN2G
+archive_id: C02MM7YG1V4
+debug_id: C02NM2H9J5Q
+api_wait_time: 90
+auth_token: xoxb-**************************************************
+app_token: xapp-1-**************************************************
+
+
+[DATABASE]
+download_db_name: downloads.db
+chat_db_name: messages.db
+db_path_prod: /app/containerdata
+db_path_dev: /code/.dev
+db_backup: /app/containerdata/backups
+
+
+[DOWNLOADS]
+local_storage_path: /app/containerdata/files
+default_download_path: /app/containerdata/tmp
+remote_storage_path: /**********
+browser_profile_path: /app/containerdata/dependencies/<profile name>
+blacklisted_href_domains: ["google.", "facebook."]
--- a/misc/sample_config/vpn.config
+++ b/misc/sample_config/vpn.config
@@ -0,0 +1,4 @@
+OPENCONNECT_URL=sslvpn.ethz.ch/student-net
+OPENCONNECT_USER=***************
+OPENCONNECT_PASSWORD=**************
+OPENCONNECT_OPTIONS=--authgroup student-net
--- a/nas_sync/Dockerfile
+++ b/nas_sync/Dockerfile
@@ -0,0 +1,9 @@
+FROM bash:latest
+# alpine with bash instead of sh
+ENV TZ=Europe/Berlin
+RUN apk add lsyncd cifs-utils rsync
+RUN mkdir -p /sync/remote_files
+COPY entrypoint.sh /sync/entrypoint.sh
+
+
+ENTRYPOINT ["bash", "/sync/entrypoint.sh"]
--- a/nas_sync/entrypoint.sh
+++ b/nas_sync/entrypoint.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+set -e
+
+sleep 5 # waits for the vpn to have an established connection
+echo "Starting NAS sync"
+mount -t cifs "//$1" -o credentials=/sync/nas_login.config /sync/remote_files
+echo "Successfully mounted SAMBA remote: $1 --> /sync/remote_files"
+shift # consumes the variable set in $1 so tat $@ only contains the remaining arguments
+
+exec "$@"
--- a/news_check/client/public/favicon.png
+++ b/news_check/client/public/favicon.png
--- a/news_check/client/public/global.css
+++ b/news_check/client/public/global.css
@@ -0,0 +1,63 @@
+html, body {
+	position: relative;
+	width: 100%;
+	height: 100%;
+}
+
+body {
+	color: #333;
+	margin: 0;
+	padding: 8px;
+	box-sizing: border-box;
+	font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue", sans-serif;
+}
+
+a {
+	color: rgb(0,100,200);
+	text-decoration: none;
+}
+
+a:hover {
+	text-decoration: underline;
+}
+
+a:visited {
+	color: rgb(0,80,160);
+}
+
+label {
+	display: block;
+}
+
+input, button, select, textarea {
+	font-family: inherit;
+	font-size: inherit;
+	-webkit-padding: 0.4em 0;
+	padding: 0.4em;
+	margin: 0 0 0.5em 0;
+	box-sizing: border-box;
+	border: 1px solid #ccc;
+	border-radius: 2px;
+}
+
+input:disabled {
+	color: #ccc;
+}
+
+button {
+	color: #333;
+	background-color: #f4f4f4;
+	outline: none;
+}
+
+button:disabled {
+	color: #999;
+}
+
+button:not(:disabled):active {
+	background-color: #ddd;
+}
+
+button:focus {
+	border-color: #666;
+}
--- a/news_check/client/src/App.svelte
+++ b/news_check/client/src/App.svelte
@@ -3,14 +3,9 @@
 	import ArticleStatus from './ArticleStatus.svelte';
 	import ArticleOperations from './ArticleOperations.svelte';

-	import Toast from './Toast.svelte';
-
-
 	let current_id = 0; 
 	
-	let interfaceState = updateInterface()
-	
-	async function updateInterface () {
+	const updateInterface = (async () => {
 		let url = '';
 		if (current_id == 0) {
 			url = '/api/article/first';
@@ -24,14 +19,12 @@
 		const article_response = await fetch(article_url);
 		const article_data = await article_response.json();
 		return article_data;
-	}
+	})()
+	
 	
-	function triggerUpdate () {
-		interfaceState = updateInterface();
-	}
 </script>

-{#await interfaceState}
+{#await updateInterface}
 ...
 {:then article_data}
 	<div class="flex w-full h-screen gap-5 p-5">
@@ -40,9 +33,7 @@
 		<div class="w-2/5">
 			<ArticleStatus article_data={article_data}/>
 			<div class="divider divider-vertical"></div> 
-			<ArticleOperations article_data={article_data} callback={triggerUpdate}/>
+			<ArticleOperations article_data={article_data}/>
 		</div>
 	</div>
 {/await}
-
-<Toast/>
--- a/news_check/client/src/ArticleOperations.svelte
+++ b/news_check/client/src/ArticleOperations.svelte
@@ -1,25 +1,21 @@
 <script>
+    import {fade} from 'svelte/transition';
+    
    export let article_data;
-    export let callback;
-    window.focus()  
-    import { addToast } from './Toast.svelte';
    
    const actions = [
      {name: 'Mark as good (and skip to next)', kbd: 'A'},
      {name: 'Mark as bad (and skip to next)', kbd: 'B'},
-      {name: 'Upload related file', kbd: 'R', comment: "can be used multiple times"},
-      {name: 'Skip', kbd: 'S'},
+      {name: 'Upload related file', kbd: 'R'},
+      {name: 'Skip', kbd: 'ctrl'},
    ]

-
-    let fileInput = document.createElement('input');
-    fileInput.type = 'file';
-
-    fileInput.onchange = e => {
-      let result = (async () => {
-        uploadRelatedFile(e.target.files[0]);
-      })()
+    const toast_states = {
+      'success' : {class: 'alert-success', text: 'Article updated successfully'},
+      'error' : {class: 'alert-error', text: 'Article update failed'},
    }
+    let toast_state = {};
+    let toast_visible = false;

    
    function onKeyDown(e) {apiAction(e.key)}
@@ -27,16 +23,6 @@
      if (actions.map(d => d.kbd.toLowerCase()).includes(key.toLowerCase())){ // ignore other keypresses

        const updateArticle = (async() => {
-          let success
-
-          if (key.toLowerCase() == "s") {
-            addToast('success', "Article skipped")
-            callback()
-            return
-          } else if (key.toLowerCase() == "r") {
-            fileInput.click() // this will trigger a change in fileInput, 
-            return
-          } else {
          const response = await fetch('/api/article/' + article_data.id + '/set', {
            method: 'POST',
            headers: {'Content-Type': 'application/json'},
@@ -44,42 +30,26 @@
              'action': key.toLowerCase(),
            })
          })
-            success = response.status == 200  
-          }
+          const success = response.status == 200;
+        
          if (success){
-              addToast('success')
-              callback()
+            showToast('success');
          } else {
-              addToast('error')
+            showToast('error');
          }
+
        })()
      }
    }

+    function showToast(state){  
+      toast_visible = true;
+      toast_state = toast_states[state];
+      setTimeout(() => {
+        toast_visible = false;
+      }, 1000)

-
-    async function uploadRelatedFile(file) {
-      const formData = new FormData()
-      formData.append('file', file)
-
-      const response = await fetch('/api/article/' + article_data.id + '/set', {
-        method: 'POST',
-        body : formData,
-      })
-      
-      const success =  response.status == 200;
-      if (success){
-        const data = await response.json()
-        let fname = data.file_path
-        addToast('success', "File uploaded as " + fname)
-      } else {
-        addToast('error', "File upload failed")
    }
-      return success;
-    }
-
-
-
 </script>


@@ -88,22 +58,21 @@
      <h2 class="card-title">Your options: (click on action or use keyboard)</h2>
      <div class="overflow-x-auto">
        <table class="table w-full table-compact">
+          <!-- head -->
          <thead>
            <tr>
                <th>Action</th>
                <th>Keyboard shortcut</th>
            </tr>
          </thead>
-
          <tbody>
            {#each actions as action}
+                
                <tr>
                    <td><button on:click={() => apiAction(action.kbd)}>{ action.name }</button></td>
-                    <td><kbd class="kbd">
-                      { action.kbd }</kbd>
-                      {#if action.comment}({ action.comment }){/if}
-                    </td>
+                    <td><kbd class="kbd">{ action.kbd }</kbd></td>
                </tr>
+            
            {/each}
          </tbody>
        </table>
@@ -111,9 +80,14 @@
    </div>
 </div>

-
-
-
-<!-- Listen for keypresses -->
 <svelte:window on:keydown|preventDefault={onKeyDown} />

+{#if toast_visible}
+<div class="toast" transition:fade>
+  <div class="alert { toast_state.class }">
+      <div>
+      <span>{ toast_state.text }.</span>
+      </div>
+  </div>
+</div>
+{/if}
--- a/news_check/client/src/ArticleStatus.svelte
+++ b/news_check/client/src/ArticleStatus.svelte
@@ -2,24 +2,13 @@
    export let article_data;
    const status_items = [
        {name: 'Title', value: article_data.title},
-        {name: 'Url', value: article_data.article_url},
-        {name: 'Source', value: article_data.source_name},
        {name: 'Filename', value: article_data.file_name},
-        {name: 'Location', value: article_data.save_path},
        {name: 'Language', value: article_data.language},
        {name: 'Authors', value: article_data.authors},
        {name: "Related", value: article_data.related},
-        {name: "Sent", value: article_data.sent},
    ]
 </script>

-<style>
-  td {
-    overflow-wrap: break-word;
-    word-wrap: break-word;
-    word-break: break-word;
-  }
-</style>
 <div class="card bg-neutral-300 shadow-xl overflow-x-auto">
    <div class="card-body">
      <h2 class="card-title">Article overview:</h2>
@@ -34,20 +23,16 @@
            {#each status_items as item}
                <tr>
                    <td>{ item.name }</td>
-                    {#if (item.value != "" || status_items.value == false) }
-                      <td class='bg-emerald-200' style="white-space: normal; width:70%">
-                        {#if item.name == "Url"}
-                          <a href="{ item.value }" target="_blank">{ item.value }</a>
+                    <!-- <td>Quality Control Specialist</td> -->
+                    {#if item.value != ""}
+                        <td class='bg-emerald-200' style="white-space: normal; width:70%">{ item.value }</td>
                    {:else}
-                          { item.value }
-                        {/if}
-                      </td>
-                    {:else}
-                      <td class='bg-red-200'>not set</td>
+                        <td class='bg-red-200'>{ item.value }</td>
                    {/if}
                </tr>
            {/each}
          </tbody>
        </table>
      </div>
+    
 </div>
--- a/news_check/client/src/Toast.svelte
+++ b/news_check/client/src/Toast.svelte
@@ -1,34 +0,0 @@
-<script  context="module">
-    import {fade} from 'svelte/transition';
-    import { writable } from 'svelte/store';
-
-
-    let toasts = writable([])
-
-    export function addToast (type, message="") {
-        if (message == "") {
-            message = toast_states[type]["text"]
-        }
-        toasts.update((all) => [{"class" : toast_states[type]["class"], "text": message}, ...all]);
-        toasts = toasts;
-        setTimeout(() => {
-            toasts.update((all) => all.slice(0, -1));
-        }, 2000);
-    }
-
-
-
-    const toast_states = {
-      'success' : {class: 'alert-success', text: 'Article updated successfully'},
-      'error' : {class: 'alert-error', text: 'Article update failed'},
-    }
-
-</script>
-
-<div class="toast">
-    {#each $toasts as toast}
-        <div class="alert { toast.class }" transition:fade>
-            <div> <span>{ toast.text }.</span> </div>
-        </div>
-    {/each}
-</div>
--- a/news_check/client/src/main.js
+++ b/news_check/client/src/main.js
@@ -2,6 +2,9 @@ import App from './App.svelte';

 const app = new App({
 	target: document.body,
+	props: {
+		name: 'world'
+	}
 });

 export default app;
--- a/news_check/requirements.txt
+++ b/news_check/requirements.txt
@@ -2,4 +2,3 @@ flask
 peewee
 markdown
 psycopg2
-pyyaml
--- a/news_check/server/app.py
+++ b/news_check/server/app.py
@@ -1,7 +1,5 @@
 from flask import Flask, send_from_directory, request
-import os
 import configuration
-
 models = configuration.models
 db = configuration.db
 app = Flask(__name__)
@@ -32,13 +30,9 @@ def get_article_by_id(id):
        return article.to_dict()

@app.route("/api/article/first")
-def get_article_first(min_id=0):
+def get_article_first():
    with db:
-        article = models.ArticleDownload.select(models.ArticleDownload.id).where(
-            (models.ArticleDownload.verified == 0) &
-            (models.ArticleDownload.id > min_id) &
-            (models.ArticleDownload.archived_by == os.getenv("UNAME"))
-            ).order_by(models.ArticleDownload.id).first()
+        article = models.ArticleDownload.select(models.ArticleDownload.id).where(models.ArticleDownload.verified == 0).order_by(models.ArticleDownload.id).first()
        return {"id" : article.id}

@app.route("/api/article/<int:id>/next")
@@ -47,47 +41,27 @@ def get_article_next(id):
        if models.ArticleDownload.get_by_id(id + 1).verified == 0:
            return {"id" : id + 1}
        else:
-            return get_article_first(min_id=id) # if the current article was skipped, but the +1 is already verified, get_first will return the same article again. so specify min id.
+            return get_article_first()



@app.route("/api/article/<int:id>/set", methods=['POST'])
 def set_article(id):
-    json = request.get_json(silent=True) # do not raise 400 if there is no json!
-    # no json usually means a file was uploaded
-    if json is None:
-        print("Detected likely file upload.")
-        action = None
-    else:
-        action = request.json.get('action', None) # action inside the json might still be empty
-
+    action = request.json['action']
    with db:
        article = models.ArticleDownload.get_by_id(id)
-        if action:
        if action == "a":
            article.verified = 1
        elif action == "b":
            article.verified = -1
-        else: # implicitly action == "r":
-            # request.files is an immutable dict
-            file = request.files.get("file", None)
-            if file is None: # upload tends to crash
-                return "No file uploaded", 400
-
-            artname, _ = os.path.splitext(article.file_name)
-            fname =  f"{artname} -- related_{article.related.count() + 1}.{file.filename.split('.')[-1]}"
-            fpath = os.path.join(article.save_path, fname)
-            print(f"Saving file to {fpath}")
-            file.save(fpath)
-            article.set_related([fname])
-            return {"file_path": fpath}
-
+        elif action == "r":
+            article.set_related()
        article.save()
        return "ok"




+
 if __name__ == "__main__":
-    debug = os.getenv("DEBUG", "false") == "true"
-    app.run(host="0.0.0.0", port="80", debug=debug)
+    app.run(host="0.0.0.0", port="80")
--- a/news_check/server/configuration.py
+++ b/news_check/server/configuration.py
@@ -1,16 +1,15 @@
 from peewee import PostgresqlDatabase
-import time
-import yaml
-import os
+import configparser

-config_location = os.getenv("CONFIG_FILE")
-with open(config_location, "r") as f:
-    config = yaml.safe_load(f)
+main_config = configparser.ConfigParser()
+main_config.read("/app/containerdata/config/news_fetch.config.ini")

-cred = config["database"]
-time.sleep(10) # wait for the vpn to connect (can't use a healthcheck because there is no depends_on)
+db_config = configparser.ConfigParser()
+db_config.read("/app/containerdata/config/db.config.ini")
+
+cred = db_config["DATABASE"]
 db = PostgresqlDatabase(
-    cred["production_db_name"], user=cred["production_user_name"], password=cred["production_password"], host="vpn", port=5432
+    cred["db_name"], user=cred["user_name"], password=cred["password"], host="vpn", port=5432
 )

 import models
--- a/news_check/server/models.py
+++ b/news_check/server/models.py
@@ -6,7 +6,7 @@ import os
 import datetime
 import configuration

-downloads_config = configuration.config["downloads"]
+config = configuration.main_config["DOWNLOADS"]

 # set the nature of the db at runtime
 download_db = DatabaseProxy()
@@ -34,14 +34,14 @@ class ArticleDownload(DownloadBaseModel):
    file_name = TextField(default = '')
    @property
    def save_path(self):
-        return f"{downloads_config['local_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/"
+        return f"{config['local_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/"
    @property
    def fname_nas(self, file_name=""):
        if self.download_date:
            if file_name:
-                return f"NAS: {downloads_config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{file_name}"
+                return f"NAS: {config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{file_name}"
            else: # return the self. name
-                return f"NAS: {downloads_config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{self.file_name}"
+                return f"NAS: {config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{self.file_name}"
        else:
            return None

--- a/news_fetch/Dockerfile
+++ b/news_fetch/Dockerfile
@@ -2,10 +2,21 @@ FROM python:latest

 ENV TZ Europe/Zurich

-RUN mkdir -p /app/news_fetch
+RUN apt-get update && apt-get install -y ghostscript
+# for compression of pdfs
+
+RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
+# id mapped to local user
+# home directory needed for pip package installation
+RUN export PATH=/home/autonews/.local/bin:$PATH
+
+
+RUN mkdir -p /app/auto_news
+RUN chown -R autonews:autonews /app
+USER autonews

 COPY requirements.txt /app/requirements.txt
 RUN python3 -m pip install -r /app/requirements.txt

-COPY . /app/news_fetch
-WORKDIR /app/news_fetch
+COPY . /app/auto_news
+WORKDIR /app/auto_news
--- a/news_fetch/configuration.py
+++ b/news_fetch/configuration.py
@@ -1,7 +1,9 @@
-import time
 import os
+import configparser
 import logging
-import yaml
+import time
+import shutil
+from datetime import datetime
 from peewee import SqliteDatabase, PostgresqlDatabase
 from rich.logging import RichHandler

@@ -17,21 +19,22 @@ logger = logging.getLogger(__name__)


 # load config file containing constants and secrets
-config_location = os.getenv("CONFIG_FILE")
-with open(config_location, "r") as f:
-    config = yaml.safe_load(f)
+main_config = configparser.ConfigParser()
+main_config.read("/app/containerdata/config/news_fetch.config.ini")
+db_config = configparser.ConfigParser()
+db_config.read("/app/containerdata/config/db.config.ini")


 # DEBUG MODE:
 if os.getenv("DEBUG", "false") == "true":
    logger.warning("Found 'DEBUG=true', setting up dummy databases")
    
-    config["slack"]["archive_id"] = config["slack"]["debug_id"]
-    config["mail"]["recipient"] = config["mail"]["sender"]
-    config["downloads"]["local_storage_path"] = config["downloads"]["debug_storage_path"]
+    main_config["SLACK"]["archive_id"] = main_config["SLACK"]["debug_id"]
+    main_config["MAIL"]["recipient"] = main_config["MAIL"]["sender"]
+    main_config["DOWNLOADS"]["local_storage_path"] = main_config["DOWNLOADS"]["debug_storage_path"]

    download_db = SqliteDatabase(
-        config["database"]["debug_db"],
+        main_config["DATABASE"]["download_db_debug"],
        pragmas = {'journal_mode': 'wal'} # mutliple threads can read at once
    )

@@ -40,9 +43,9 @@ else:
    logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
    
    time.sleep(10) # wait for the vpn to connect (can't use a healthcheck because there is no depends_on)
-    cred = config["database"]
+    cred = db_config["DATABASE"]
    download_db = PostgresqlDatabase(
-        cred["production_db_name"], user=cred["production_user_name"], password=cred["production_password"], host="vpn", port=5432
+        cred["db_name"], user=cred["user_name"], password=cred["password"], host="vpn", port=5432
    )
    # TODO Reimplement backup/printout
    # logger.info("Backing up databases")
@@ -61,5 +64,5 @@ else:

 from utils_storage import models

-# Set up the database connection (also creates tables if they don't exist)
+# Set up the database
 models.set_db(download_db)
--- a/news_fetch/requirements.txt
+++ b/news_fetch/requirements.txt
@@ -9,5 +9,3 @@ htmldate
 markdown
 rich
 psycopg2
-unidecode
-pyyaml
--- a/news_fetch/runner.py
+++ b/news_fetch/runner.py
@@ -1,5 +1,4 @@
 """Main coordination of other util classes. Handles inbound and outbound calls"""
-from time import sleep
 import configuration
 models = configuration.models
 from threading import Thread
@@ -11,7 +10,7 @@ from collections import OrderedDict

 from utils_mail import runner as MailRunner
 from utils_slack import runner as SlackRunner
-from utils_worker.workers import DownloadWorker, FetchWorker, UploadWorker
+from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, UploadWorker


 class ArticleWatcher:
@@ -111,8 +110,7 @@ class Dispatcher(Thread):
            logger.error("Dispatcher.incoming_request called with no arguments")
            return

-        if is_new or (article.file_name == "" and article.verified == 0) \
-            or (not is_new and len(self.workers_in) == 1): # this is for upload
+        if is_new or (article.file_name == "" and article.verified == 0):
            # check for models that were created but were abandonned. This means they have missing information, most importantly no associated file
            # this overwrites previously set information, but that should not be too important
            ArticleWatcher(
@@ -123,21 +121,17 @@ class Dispatcher(Thread):

        else: # manually trigger notification immediatly
            logger.info(f"Found existing article {article}. Now sending")
+            self.article_complete_notifier(article)



+    # def manual_processing(self, articles, workers):
+    #     for w in workers:
+    #         w.start()

-class PrintWorker:
-    def __init__(self, action, sent = False) -> None:
-        self.action = action
-        self.sent = sent
-    def send(self, article):
-        print(f"{self.action} article {article}")
-        if self.sent:
-            article.sent = True
-            article.save()
-    def keep_alive(self): # keeps script running, because there is nothing else in the main thread
-        while True: sleep(1)
+    #     for article in articles:
+    #         notifier = lambda article: logger.info(f"Completed manual actions for {article}")
+    #         ArticleWatcher(article, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg 



@@ -145,26 +139,25 @@ if __name__ == "__main__":
    dispatcher = Dispatcher()

    if "upload" in sys.argv:
+        class PrintWorker:
+            def send(self, article):
+                print(f"Uploaded article {article}")

        articles = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "" or models.ArticleDownload.archive_url == "TODO:UPLOAD").execute()
        logger.info(f"Launching upload to archive for {len(articles)} articles.")

        dispatcher.workers_in = [{"UploadWorker": UploadWorker()}]
-        print_worker = PrintWorker("Uploaded")
-        dispatcher.workers_out = [{"PrintWorker": print_worker}]
+        dispatcher.workers_out = [{"PrintWorker": PrintWorker()}]
        dispatcher.start()
-        for a in articles:
-            dispatcher.incoming_request(article=a)
-        print_worker.keep_alive()

    else: # launch with full action
        try:
            slack_runner = SlackRunner.BotRunner(dispatcher.incoming_request)
            # All workers are implemented as a threaded queue. But the individual model requires a specific processing order:
-            # fetch -> download (-> compress) -> complete
+            # fetch -> download -> compress -> complete
            # This is reflected in the following list of workers:
            workers_in = [
-                OrderedDict({"FetchWorker": FetchWorker(), "DownloadWorker": DownloadWorker(), "NotifyRunner": "out"}),
+                OrderedDict({"FetchWorker": FetchWorker(), "DownloadWorker": DownloadWorker(), "CompressWorker": CompressWorker(), "NotifyRunner": "out"}),
                OrderedDict({"UploadWorker": UploadWorker()})
            ]
            # The two dicts are processed independently. First element of first dict is called at the same time as the first element of the second dict
--- a/news_fetch/utils_check/runner.py
+++ b/news_fetch/utils_check/runner.py
@@ -0,0 +1,208 @@
+from rich.console import Console
+from rich.table import Table
+from rich.columns import Columns
+from rich.rule import Rule
+console = Console()
+hline = Rule(style="white")
+
+import os
+import subprocess
+from slack_sdk import WebClient
+import configuration
+models = configuration.models
+
+u_options = {
+    "ENTER" : "Accept PDF as is. It gets marked as verified",
+    "D" : "set languange to DE and set verified",
+    "E" : "set languange to EN and set verified",
+    "O" : "set other language (prompted)",
+    "R" : "set related files (prompted multiple times)",
+    "B" : "reject and move to folder BAD",
+    "L" : "leave file as is, do not send reaction"
+}
+
+
+bot_client = WebClient(
+    token = configuration.main_config["SLACK"]["auth_token"]
+)
+
+
+
+
+
+def file_overview(file_url: str, file_attributes: list, options: dict) -> None:
+    """Prints a neat overview of the current article"""
+    file_table = Table(
+        title = file_url,
+        row_styles = ["white", "bright_black"],
+        min_width = 100
+    )
+
+    file_table.add_column("Attribute", justify = "right", no_wrap = True)
+    file_table.add_column("Value set by auto_news")
+    file_table.add_column("Status", justify = "right")
+    for attr in file_attributes:
+        file_table.add_row(attr["name"], attr["value"], attr["status"])
+
+    
+    option_key = "\n".join([f"[[bold]{k}[/bold]]" for k in options.keys()])
+    option_action = "\n".join([f"[italic]{k}[/italic]" for k in options.values()])
+    columns = Columns([option_key, option_action])
+
+    console.print(file_table)
+    console.print("Your options:")
+    console.print(columns)
+
+
+def send_reaction_to_slack_thread(article, reaction):
+    """Sends the verification status as a reaction to the associated slack thread."""
+    thread = article.slack_thread
+    messages = models.Message.select().where(models.Message.text.contains(article.article_url))
+    # TODO rewrite this shit
+    if len(messages) > 5:
+        print("Found more than 5 messages. Aborting reactions...")
+        return
+    for m in messages:
+        if m.is_processed_override:
+            print("Message already processed. Aborting reactions...")
+        elif not m.has_single_url:
+            print("Found thread but won't send reaction because thread has multiple urls")
+        else:
+            ts = m.slack_ts
+            bot_client.reactions_add(
+                channel=configuration.main_config["SLACK"]["archive_id"],
+                name=reaction,
+                timestamp=ts
+            )
+            print("Sent reaction to message")
+
+
+def prompt_language(query):
+    not_set = True
+    while not_set:
+        uin = input("Set language (nation-code, 2 letters) ")
+        if len(uin) != 2:
+            print("Bad code, try again")
+        else:
+            not_set = False
+            query.language = uin
+            query.save()
+
+
+def prompt_related(query):
+    file_list = []
+    finished = False
+    while not finished:
+        uin = input("Additional file for article? Type '1' to cancel ")
+        if uin == "1":
+            query.set_related(file_list)
+            finished = True
+        else:
+            file_list.append(uin)
+
+
+def prompt_new_fname(query):
+    uin = input("New fname? ")
+    old_fname =  query.file_name
+    query.file_name = uin
+    query.verified = 1
+    if old_fname != "":
+        os.remove(query.save_path + old_fname)
+    query.save()    
+
+
+
+def reject_article(article):
+    article.verified = -1
+    article.save()
+    print("Article marked as bad")
+    # also update the threads to not be monitored anymore
+    send_reaction_to_slack_thread(article, "x")
+
+
+def unreject_article(query):
+    query.verified = 1
+    query.save()
+    # os.rename(badpdf, fname)
+    print("File set to verified")
+
+
+def accept_article(article, last_accepted):
+    article.verified = 1
+    article.save()
+    print("Article accepted as GOOD")
+
+    # also update the threads to not be monitored anymore
+    send_reaction_to_slack_thread(article, "white_check_mark")
+
+    return "" # linked
+
+
+
+
+
+
+def verify_unchecked():
+    query = models.ArticleDownload.select().where(models.ArticleDownload.verified == 0).execute()
+    last_linked = None
+
+    for article in query:
+        console.print(hline)
+        core_info = []
+        for e, name in zip([article.save_path, article.file_name, article.title, article.language], ["Save path", "File name", "Title", "Language"]):
+            entry = {
+                "status" : "[red]██[/red]" if (len(e) == 0 or e == -1) else "[green]██[/green]",
+                "value" : e if len(e) != 0 else "not set",
+                "name" : name
+            }
+            core_info.append(entry)
+        
+        try:
+            # close any previously opened windows:
+            # subprocess.call(["kill", "`pgrep evince`"])
+            os.system("pkill evince")
+            # then open a new one
+            subprocess.Popen(["evince", f"file://{os.path.join(article.save_path, article.file_name)}"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+            # supress evince gtk warnings
+        except Exception as e:
+            print(e)
+            continue
+
+        
+
+        file_overview(
+            file_url = article.article_url, 
+            file_attributes=core_info,
+            options = u_options
+        )
+
+
+        proceed = False
+        while not proceed:
+            proceed = False
+            uin = input("Choice ?").lower()
+            if uin == "":
+                last_linked = accept_article(article, last_linked) # last linked accelerates the whole process
+                proceed = True
+            elif uin == "d":
+                article.language = "de"
+                article.verified = 1
+                article.save()
+                proceed = True
+            elif uin == "e":
+                article.language = "en"
+                article.verified = 1
+                article.save()
+                proceed = True
+            elif uin == "o":
+                prompt_language(article)
+            elif uin == "r":
+                prompt_related(article)
+            elif uin == "b":
+                reject_article(article)
+                proceed = True
+            elif uin == "l":
+                # do nothing
+                proceed = True
+            else:
+                print("Invalid input")
--- a/news_fetch/utils_mail/runner.py
+++ b/news_fetch/utils_mail/runner.py
@@ -7,23 +7,22 @@ import logging
 import configuration

 logger = logging.getLogger(__name__)
-mail_config = configuration.config["mail"]
+config = configuration.main_config["MAIL"]

 def send(article_model):
    mail = MIMEMultipart()
    mail['Subject'] = "{} -- {}".format(article_model.source_name, article_model.title)
-    mail['From'] = mail_config["sender"]
-    mail['To'] = mail_config["recipient"]
+    mail['From'] = config["sender"]
+    mail['To'] = config["recipient"]

-    try:
-        msg, files = article_model.mail_info() # this is html
-    except: # Raised by model if article has no associated file
-        logger.info("Skipping mail sending")
-        return
+    msgs = article_model.mail_info # this is html
+    msg = [m["reply_text"] for m in msgs]
+    msg = "\n".join(msg)

    content = MIMEText(msg, "html")
    mail.attach(content)

+    files = [m["file_path"] for m in msgs if m["file_path"]]
    for path in files:
        with open(path, 'rb') as file:
            part = MIMEApplication(file.read(), "pdf")
@@ -32,15 +31,10 @@ def send(article_model):
        mail.attach(part)

    try:
-        try:
-            smtp = smtplib.SMTP(mail_config["smtp_server"], mail_config["port"])
-        except ConnectionRefusedError:
-            logger.error("Server refused connection. Is this an error on your side?")
-            return False
-
+        smtp = smtplib.SMTP(config["smtp_server"], config["port"])
        smtp.starttls()
-        smtp.login(mail_config["uname"], mail_config["password"])
-        smtp.sendmail(mail_config["sender"], mail_config["recipient"], mail.as_string())
+        smtp.login(config["uname"], config["password"])
+        smtp.sendmail(config["sender"], config["recipient"], mail.as_string())
        smtp.quit()
        logger.info("Mail successfully sent.")
    except smtplib.SMTPException as e:
--- a/news_fetch/utils_slack/runner.py
+++ b/news_fetch/utils_slack/runner.py
@@ -7,7 +7,7 @@ import re
 import time

 import configuration
-slack_config = configuration.config["slack"]
+config = configuration.main_config["SLACK"]
 models = configuration.models

 class MessageIsUnwanted(Exception):
@@ -61,7 +61,7 @@ class Message:

    @property
    def is_by_human(self):
-        return self.user.user_id != slack_config["bot_id"]
+        return self.user.user_id != config["bot_id"]

    
    @property
@@ -87,7 +87,7 @@ class BotApp(App):

    def say_substitute(self, *args, **kwargs):
        self.client.chat_postMessage(
-            channel=slack_config["archive_id"],
+            channel=config["archive_id"],
            text=" - ".join(args),
            **kwargs
        )
@@ -101,7 +101,7 @@ class BotApp(App):
            last_ts = presaved.slack_ts_full

        result = self.client.conversations_history(
-            channel=slack_config["archive_id"],
+            channel=config["archive_id"],
            oldest=last_ts
        )

@@ -116,7 +116,7 @@ class BotApp(App):
        while refetch: # we have not actually fetched them all
            try:
                result = self.client.conversations_history(
-                    channel = slack_config["archive_id"],
+                    channel = config["archive_id"],
                    cursor = result["response_metadata"]["next_cursor"],
                    oldest = last_ts
                ) # fetches 100 messages, older than the [-1](=oldest) element of new_fetches
@@ -126,8 +126,8 @@ class BotApp(App):
                for m in new_messages:
                    return_messages.append(Message(m))
            except SlackApiError: # Most likely a rate-limit
-                self.logger.error("Error while fetching channel messages. (likely rate limit) Retrying in {} seconds...".format(slack_config["api_wait_time"]))
-                time.sleep(slack_config["api_wait_time"])
+                self.logger.error("Error while fetching channel messages. (likely rate limit) Retrying in {} seconds...".format(config["api_wait_time"]))
+                time.sleep(config["api_wait_time"])
                refetch = True
        
        self.logger.info(f"Fetched {len(return_messages)} new channel messages.")
@@ -154,10 +154,37 @@ class BotApp(App):


    def respond_channel_message(self, article, say=None):
+        if say is None:
+            say = self.say_substitute
+        answers = article.slack_info
        if article.slack_ts == 0:
            self.logger.error(f"{article} has no slack_ts")
        else:
-            self.logger.info("Skipping slack reply.")
+            self.logger.info("Skipping slack reply because it is broken")
+            for a in []:
+            # for a in answers:
+                if a["file_path"]:
+                    try:
+                        self.client.files_upload(
+                            channels = config["archive_id"],
+                            initial_comment = f"{a['reply_text']}",
+                            file = a["file_path"],
+                            thread_ts = article.slack_ts_full
+                        )
+                        # status = True
+                    except SlackApiError as e: # upload resulted in an error
+                        say(
+                            "File {} could not be uploaded.".format(a),
+                            thread_ts = article.slack_ts_full
+                        )
+                        # status = False
+                        self.logger.error(f"File upload failed: {e}")
+                else: # anticipated that there is no file!
+                    say(
+                        f"{a['reply_text']}",
+                        thread_ts = article.slack_ts_full
+                    )
+                    # status = True
        

    def startup_status(self):
@@ -181,7 +208,7 @@ class BotRunner():

    """Stupid encapsulation so that we can apply the slack decorators to the BotApp"""
    def __init__(self, callback, *args, **kwargs) -> None:
-        self.bot_worker = BotApp(callback, token=slack_config["auth_token"])
+        self.bot_worker = BotApp(callback, token=config["auth_token"])

        @self.bot_worker.event(event="message", matchers=[is_message_in_archiving])
        def handle_incoming_message(message, say):
@@ -195,7 +222,7 @@ class BotRunner():
        def handle_all_other_reactions(event, say):
            self.logger.log("Ignoring slack event that isn't a message")

-        self.handler = SocketModeHandler(self.bot_worker, slack_config["app_token"])
+        self.handler = SocketModeHandler(self.bot_worker, config["app_token"])


    def start(self):
@@ -215,5 +242,5 @@ class BotRunner():


 def is_message_in_archiving(message) -> bool:
-    return message["channel"] == slack_config["archive_id"]
+    return message["channel"] == config["archive_id"]

--- a/news_fetch/utils_storage/helpers.py
+++ b/news_fetch/utils_storage/helpers.py
@@ -1,11 +1,7 @@
-import unidecode
-KEEPCHARACTERS = (' ','.','_', '-')
-
 def clear_path_name(path):
-    path = unidecode.unidecode(path) # remove umlauts, accents and others
-    path = "".join([c if (c.isalnum() or c in KEEPCHARACTERS) else "_" for c in path]) # remove all non-alphanumeric characters
-    path = path.rstrip() # remove trailing spaces
-    return path
+    keepcharacters = (' ','.','_', '-')
+    converted = "".join([c if (c.isalnum() or c in keepcharacters) else "_" for c in path]).rstrip()
+    return converted

 def shorten_name(name, offset = 50):
    if len(name) > offset:
--- a/news_fetch/utils_storage/models.py
+++ b/news_fetch/utils_storage/models.py
@@ -8,9 +8,8 @@ import configuration
 import datetime

 from . import helpers
-downloads_config = configuration.config["downloads"]
-FILE_SIZE_THRESHOLD = 15 * 1024 * 1024 # 15MB
-
+config = configuration.main_config["DOWNLOADS"]
+slack_config = configuration.main_config["SLACK"]

 # set the nature of the db at runtime
 download_db = DatabaseProxy()
@@ -33,8 +32,7 @@ class ArticleDownload(DownloadBaseModel):
    def is_title_bad(self):  # add incrementally
        return "PUR-Abo" in self.title \
            or "Redirecting" in self.title \
-            or "Error while running fetch" in self.title \
-            or self.title == ""
+            or "Error while running fetch" in self.title

    summary = TextField(default = '')
    source_name = CharField(default = '')
@@ -44,14 +42,14 @@ class ArticleDownload(DownloadBaseModel):
    file_name = TextField(default = '')
    @property
    def save_path(self):
-        return f"{downloads_config['local_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/"
+        return f"{config['local_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/"
    @property
    def fname_nas(self, file_name=""):
        if self.download_date:
            if file_name:
-                return f"NAS: {downloads_config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{file_name}"
+                return f"NAS: {config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{file_name}"
            else: # return the self. name
-                return f"NAS: {downloads_config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{self.file_name}"
+                return f"NAS: {config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{self.file_name}"
        else:
            return None
    @property
@@ -96,38 +94,50 @@ class ArticleDownload(DownloadBaseModel):
            desc = f"{self.article_url}"
        return f"ART [{desc}]"

-    def mail_info(self):
-        summary = "\n> " + "\n> ".join(self.summary.split("\n"))
-        answer_text = f"[{self.article_url}]({self.article_url})\n\n" # first the url
-        answer_files = []
-        # displays the summary in a blockquote
+    @property
+    def slack_info(self):
+        status = [":x: No better version available", ":gear: Verification pending", ":white_check_mark: Verified by human"][self.verified + 1]
+        content = "\n>" + "\n>".join(self.summary.split("\n"))
+        file_status, msg = self.file_status()
+        if not file_status:
+            return [msg]
        
-        try:
-            self.ensure_file_present()
-            answer_text += f"*{self.title}*\n{summary}"
-            answer_files.append(self.save_path + self.file_name)
-
-        except Exception as e:
-            msg = e.args[0]
-            logger.error(f"Article {self} has file-issues: {msg}")
-            if "file too big" in msg:
-                location = f"File too big to send directly. Location on NAS:\n`{self.fname_nas}`"
-                answer_text += f"*{self.title}*\n{summary}\n{location}"
-                
-            else: # file not found, or filename not set
-                raise e
-                # reraise the exception, so that the caller can handle it
+        # everything alright: generate real content
+        # first the base file
+        if self.file_name[-4:] == ".pdf":
+            answer = [{ # main reply with the base pdf
+                "reply_text" : f"*{self.title}*\n{status}\n{content}",
+                "file_path" : self.save_path + self.file_name 
+            }]
+        else: # don't upload if the file is too big!
+            location = f"Not uploaded to slack, but the file will be on the NAS:\n`{self.fname_nas}`"
+            answer = [{ # main reply with the base pdf
+                "reply_text" : f"*{self.title}*\n{status}\n{content}\n{location}",
+                "file_path" : None 
+            }]

        # then the related files
-        if self.related:
-            rel_text = "Related files on NAS:"
+        rel_text = ""
        for r in self.related:
            fname = r.related_file_name
-                rel_text += f"\n• `{self.fname_nas(fname)}` "
+            lentry = "\n• `{}` ".format(self.fname_nas(fname))
+            if fname[-4:] == ".pdf": # this is a manageable file, directly upload
+                f_ret = self.save_path + fname
+                answer.append({"reply_text":"", "file_path" : f_ret})
+            else: # not pdf <=> too large. Don't upload but mention its existence
+                lentry += "(not uploaded to slack, but the file will be on the NAS)"
                
-            answer_text += "\n\n" + rel_text
+            rel_text += lentry

-        return markdown.markdown(answer_text), answer_files
+        if rel_text:
+            rel_text = answer[0]["reply_text"] = answer[0]["reply_text"] + "\nRelated files:\n" + rel_text
+        
+        return answer
+
+    @property
+    def mail_info(self):
+        base = [{"reply_text": f"[{self.article_url}]({self.article_url})\n", "file_path":None}] + self.slack_info
+        return [{"reply_text": markdown.markdown(m["reply_text"]), "file_path": m["file_path"]} for m in base]


    def set_authors(self, authors):
@@ -148,15 +158,17 @@ class ArticleDownload(DownloadBaseModel):
                related_file_name = r
            )

-    def ensure_file_present(self):
+    def file_status(self):
        if not self.file_name:
-            raise Exception("no filename")
+            logger.error(f"Article {self} has no filename!")
+            return False, {"reply_text": "Download failed, no file was saved.", "file_path": None}
+        
        file_path_abs = self.save_path + self.file_name
        if not os.path.exists(file_path_abs):
-            raise Exception("file not found")
-        if (os.path.splitext(file_path_abs)[1] != ".pdf") or (os.path.getsize(file_path_abs) > FILE_SIZE_THRESHOLD):
-            raise Exception("file too big")
+            logger.error(f"Article {self} has a filename, but the file does not exist at that location!")
+            return False, {"reply_text": "Can't find file. Either the download failed or the file was moved.", "file_path": None}

+        return True, {}


 class ArticleAuthor(DownloadBaseModel):
--- a/news_fetch/utils_worker/compress/runner.py
+++ b/news_fetch/utils_worker/compress/runner.py
@@ -0,0 +1,47 @@
+import os
+import subprocess
+from pathlib import Path
+
+import logging
+logger = logging.getLogger(__name__)
+import configuration
+config = configuration.main_config["DOWNLOADS"]
+
+shrink_sizes = []
+
+def shrink_pdf(article):
+    article_loc = Path(article.save_path) / article.file_name
+    initial_size = article_loc.stat().st_size
+    compressed_tmp = Path(config['default_download_path']) / "compressed.pdf"
+
+    if article_loc.suffix != "pdf":
+        return article # it probably was a youtube video
+        
+    c = subprocess.run(
+        [
+            "gs",
+            "-sDEVICE=pdfwrite",
+            "-dPDFSETTINGS=/screen",
+            "-dNOPAUSE",
+            "-dBATCH",
+            f"-sOutputFile={compressed_tmp}", 
+            f"{article_loc}"
+        ],
+        stdout=subprocess.PIPE, stderr=subprocess.PIPE
+    )
+
+    if c.returncode == 0:
+        try:
+            os.replace(compressed_tmp, article_loc)
+        except OSError as e:
+            logger.error(f"Compression ran but I could not copy back the file {e}")
+
+        final_size = article_loc.stat().st_size
+        shrink_sizes.append(initial_size - final_size)
+        logger.info(f"Compression worked. Avg shrinkage: {int(sum(shrink_sizes)/len(shrink_sizes) / 1000)} KB")
+
+
+    else:
+        logger.error(f"Could not run the compression! {c.stderr.decode()} - {c.stdout.decode()}")
+    
+    return article
--- a/news_fetch/utils_worker/download/browser.py
+++ b/news_fetch/utils_worker/download/browser.py
@@ -1,144 +1,148 @@
-import logging
 import time
 import datetime
-
-import os, shutil, uuid
-from pathlib import Path
-
+import logging
+import os
 import base64
 import requests
 from selenium import webdriver
-
 import configuration
+import json

-download_config = configuration.config["downloads"]
-
-def driver_running(f):
-    def wrapper(*args, **kwargs):
-        self = args[0]
-        if not self._running:
-            self.start()
-        return f(*args, **kwargs)
-    return wrapper
-
+config = configuration.main_config["DOWNLOADS"]
+blacklisted = json.loads(config["blacklisted_href_domains"])


 class PDFDownloader:
    """Saves a given url. Fills the object it got as a parameter"""
    logger = logging.getLogger(__name__)
-    _running = False
-
+    # status-variable for restarting:
+    running = False
    
    def start(self):
-        """Called externally to start the driver, but after an exception can also be called internally"""
-        if self._running:
        self.finish() # clear up
            
-        self.logger.info("Starting geckodriver")
-        
-        reduced_path = self.create_tmp_profile()
-        profile = webdriver.FirefoxProfile(reduced_path)
        options = webdriver.FirefoxOptions()
+        options.profile = config["browser_profile_path"]
+        # should be options.set_preference("profile", config["browser_profile_path"]) as of selenium 4 but that doesn't work

        if os.getenv("DEBUG", "false") == "true":
            self.logger.warning("Opening browser GUI because of 'DEBUG=true'")
        else:
            options.add_argument('--headless')

+        options.set_preference('print.save_as_pdf.links.enabled', True)
+        # Just save if the filetype is pdf already
+        # TODO: this is not working right now
+
+        options.set_preference("print.printer_Mozilla_Save_to_PDF.print_to_file", True)
+        options.set_preference("browser.download.folderList", 2)
+        # options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
+        # options.set_preference("pdfjs.disabled", True)
+        options.set_preference("browser.download.dir", config["default_download_path"])
+
+        self.logger.info("Starting gecko driver")
+        # peviously, in a single docker image:
+        # self.driver = webdriver.Firefox(
+        #     options = options,
+        #     service = webdriver.firefox.service.Service(
+        #         log_path = f'{config["local_storage_path"]}/geckodriver.log'
+        # ))
        self.driver = webdriver.Remote(
-            command_executor = 'http://geckodriver:4444', # the host geckodriver points to the geckodriver container
+            command_executor = 'http://geckodriver:4444',
            options = options,
-            browser_profile = profile
+            # can't set log path...
        )
        
-        self._running = True
+        residues = os.listdir(config["default_download_path"])
+        for res in residues:
+            os.remove(os.path.join(config["default_download_path"], res))

+        self.running = True
+
+    def autostart(self):
+        if not self.running:
+            self.start()  # relaunch the dl util

    def finish(self):
-        self.logger.info("Exiting Geckodriver")
+        if self.running:
+            self.logger.info("Exiting gecko driver")
            try:
                self.driver.quit()
                time.sleep(10)
            except:
                self.logger.critical("Connection to the driver broke off")
-        self._running = False
+            self.running = False
+        else:
+            self.logger.info("Gecko driver not yet running")

-
-    @driver_running
    def download(self, article_object):
+        sleep_time = 2
+        self.autostart()
        url = article_object.article_url

-
-        if url[-4:] == ".pdf": # calling the ususal pdf generation would not yield a nice pdf, just download it directly
-            self.logger.info("Downloading existing pdf")
-            success = self.get_exisiting_pdf(article_object)
-            # get a page title if required
-            if article_object.is_title_bad:
-                article_object.title = self.driver.title.replace(".pdf", "") # some titles end with .pdf
-                # will be propagated to the saved file (dst) as well
-        else:
-            success = self.get_new_pdf(article_object)
-
-        if not success:
-            self.logger.error("Download failed")
-        # TODO: need to reset the file name to empty?
-        return article_object # changes to this are saved later by the external caller
-
-
-    def get_exisiting_pdf(self, article_object):
-        # get a better page title if required
-        if article_object.is_title_bad:
-            article_object.title = article_object.article_url.split("/")[-1].split(".pdf")[0]
        try:
-            r = requests.get(article_object.article_url)
-            bytes = r.content
-        except:
-            return False
-        return self.write_pdf(bytes, article_object)
-
-
-    def get_new_pdf(self, article_object):
-        sleep_time = int(download_config["browser_print_delay"])
-
-        try:
-            self.driver.get(article_object.article_url)
+            self.driver.get(url)
        except Exception as e:
            self.logger.critical("Selenium .get(url) failed with error {}".format(e))
            self.finish()
-            return False
+            return article_object  # without changes
        
        time.sleep(sleep_time)
        # leave the page time to do any funky business

+        # in the mean time, get a page title if required
        if article_object.is_title_bad:
-            article_object.title = self.driver.title
+            article_object.title = self.driver.title.replace(".pdf", "") # some titles end with .pdf
+            # will be propagated to the saved file (dst) as well

+        fname = article_object.fname_template
+        dst = os.path.join(article_object.save_path, fname)
+        if os.path.exists(dst):
+            fname = make_path_unique(fname)
+            dst = os.path.join(article_object.save_path, fname)
+
+
+        if url[-4:] == ".pdf":
+            # according to the browser preferences, calling the url will open pdfjs.
+            # If not handled separately, printing would require the ctrl+p route, but setup is janky to say the least
+            success = self.get_exisiting_pdf(url, dst)
+        else:
+            success = self.get_new_pdf(dst)
+
+
+        if success:
+            article_object.file_name = fname
+        else:
+            article_object.file_name = ""
+        
+        return article_object  # this change is saved later by the external caller
+
+
+    def get_exisiting_pdf(self, url, dst):
+        try:
+            r = requests.get(url)
+            bytes = r.content
+        except:
+            return False
+        return self.get_new_pdf(dst, other_bytes=bytes)
+
+
+    def get_new_pdf(self, dst, other_bytes=None):
+        os.makedirs(os.path.dirname(dst), exist_ok=True)
+
+        if other_bytes is None:
            try:
                result = self.driver.print_page()
                bytes = base64.b64decode(result, validate=True)
            except:
                self.logger.error("Failed, probably because the driver went extinct.")
                return False
-
-        return self.write_pdf(bytes, article_object)
-
-
-    def get_file_destination(self, article_object):
-        fname = article_object.fname_template
-        fname = ensure_unique(article_object.save_path, fname)
-        dst = os.path.join(article_object.save_path, fname)
-        return dst, fname
-
-
-    def write_pdf(self, content, article_object):
-        dst, fname = self.get_file_destination(article_object)
-        os.makedirs(os.path.dirname(dst), exist_ok=True)
+        else:
+            bytes = other_bytes

        try:
            with open(dst, "wb+") as f:
-                f.write(content)
-            
-            article_object.file_name = fname
+                f.write(bytes)
            return True
        except Exception as e:
            self.logger.error(f"Failed, because of FS-operation: {e}")
@@ -147,34 +151,11 @@ class PDFDownloader:



-    def create_tmp_profile(self, full_profile_path: Path = Path(download_config["browser_profile_path"])) -> Path:
-        reduced_profile_path = Path(f"/tmp/firefox_profile_{uuid.uuid4()}")
-        os.mkdir(reduced_profile_path)
-        # copy needed directories
-        dirs = ["extensions", "storage"]
-        for dir in dirs:
-            shutil.copytree(full_profile_path / dir, reduced_profile_path / dir)
-
-        # copy needed files
-        files = ["extension-preferences.json", "addons.json", "addonStartup.json.lz4", "prefs.js", "extensions.json", "cookies.sqlite"]
-        for f in files:
-            shutil.copy(full_profile_path / f, reduced_profile_path)
-        
-        folder_size = round(sum(p.stat().st_size for p in Path(reduced_profile_path).rglob('*')) / 1024 / 1024, 3)
-        self.logger.info(f"Generated temporary profile at {reduced_profile_path} with size {folder_size} MB")
-        return reduced_profile_path




-def ensure_unique(path, fname):
-    fbase, ending = os.path.splitext(fname)
-
-    exists = os.path.exists(os.path.join(path, fname))
-    i = 1
-    while exists:
-        fname = fbase + f" -- fetch {i}" + ending
-        i += 1
-        exists = os.path.exists(os.path.join(path, fname))
-    
-    return fname
+def make_path_unique(path):
+    fname, ending = os.path.splitext(path)
+    fname += datetime.datetime.now().strftime("%d-%H%M%S")
+    return fname + ending
--- a/news_fetch/utils_worker/download/youtube.py
+++ b/news_fetch/utils_worker/download/youtube.py
@@ -1,11 +1,11 @@
+from __future__ import unicode_literals
 import youtube_dl
 import os
 import logging
-import configuration

-download_config = configuration.config["downloads"]
 logger = logging.getLogger(__name__)

+
 class MyLogger(object):
    def debug(self, msg): pass
    def warning(self, msg): pass
@@ -20,6 +20,7 @@ class YouTubeDownloader:


    def post_download_hook(self, ret_code):
+        # print(ret_code)
        if ret_code['status'] == 'finished':
            file_loc = ret_code["filename"]
            fname = os.path.basename(file_loc)
@@ -35,11 +36,9 @@ class YouTubeDownloader:
        ydl_opts = {
            'format': 'best[height<=720]',
            'outtmpl': f"{file_path}.%(ext)s", # basically the filename from the object, but with a custom extension depending on the download
-            'logger': MyLogger(), # supress verbosity
+            'logger': MyLogger(),
            'progress_hooks': [self.post_download_hook],
-            'updatetime': False,
-            # File is also used by firefox so make sure to not write to it!
-            # youtube dl apparenlty does not support cookies.sqlite and the documentation is not clear on how to use cookies.txt
+            'updatetime': False
        }
        try:
            with youtube_dl.YoutubeDL(ydl_opts) as ydl:
@@ -48,9 +47,5 @@ class YouTubeDownloader:
        except Exception as e:
            logger.error(f"Youtube download crashed: {e}")
            article_object.file_name = ""
-            logfile = os.path.join(download_config["local_storage_path"], "failed_downloads.csv")
-            logger.info(f"Logging youtube errors seperately to {logfile}")
-            with open(logfile, "a+") as f:
-                f.write(f"{url}\n")

        return article_object
--- a/news_fetch/utils_worker/upload/runner.py
+++ b/news_fetch/utils_worker/upload/runner.py
@@ -1,3 +1,4 @@
+import time
 from waybackpy import WaybackMachineSaveAPI # upload to archive.org
 import logging
 logger = logging.getLogger(__name__)
--- a/news_fetch/utils_worker/workers.py
+++ b/news_fetch/utils_worker/workers.py
@@ -3,7 +3,7 @@ from .download.browser import PDFDownloader
 from .download.youtube import YouTubeDownloader
 from .fetch.runner import get_description
 from .upload.runner import upload_to_archive as run_upload
-
+from .compress.runner import shrink_pdf

 import time
 import logging
@@ -53,3 +53,14 @@ class UploadWorker(TemplateWorker):

        super()._handle_article(article_watcher, action)
        # article_watcher.upload_completed = True
+
+
+
+class CompressWorker(TemplateWorker):
+    def __init__(self) -> None:
+        super().__init__()
+
+    def _handle_article(self, article_watcher):
+        action = shrink_pdf
+        super()._handle_article(article_watcher, action)
+        # article_watcher.compression_completed = True
Author	SHA1	Message	Date
Remy Moll	4e2245044f	Working, refactored news_fetch, better documentation for launch	2022-09-08 16:19:15 +02:00
Remy Moll	0eeec45ea6	few bugs in news_fetch left, news_chek wip	2022-09-06 22:15:26 +02:00
Remy Moll	77c96be844	reduced slack functionality, higher ease of use. Database migration wip	2022-09-05 16:29:19 +02:00
Remy Moll	e1a8dabd2c	WIP: Article checker with svelte	2022-08-31 12:09:21 +02:00