Switched to docker compose and wasted hours trying to have standalone firefox

2022-05-29 18:29:31 +02:00 · 2022-05-29 18:29:31 +02:00 · 54760abee4
commit 54760abee4
parent 878a1dff5d
13 changed files with 154 additions and 42 deletions
--- a/.dockerignore
+++ b/.dockerignore
@ -1 +1,2 @@
 .dev/
 __pycache__/
--- a/10
+++ b/10
@ -1,6 +1,8 @@
 FROM python:latest
 ENV TZ Euopre/Zurich
 RUN echo "deb http://deb.debian.org/debian/ unstable main contrib non-free" >> /etc/apt/sources.list
 RUN apt-get update && apt-get install -y \
 evince \
@ -16,7 +18,6 @@ RUN wget https://github.com/mozilla/geckodriver/releases/download/v0.31.0/geckod
 RUN tar -x geckodriver -zf geckodriver-v0.31.0-linux64.tar.gz -O > /usr/bin/geckodriver
 RUN chmod +x /usr/bin/geckodriver
 RUN rm geckodriver-v0.31.0-linux64.tar.gz
 RUN echo "127.0.0.1 localhost" >> /etc/hosts
 RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
@ -24,15 +25,12 @@ RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
 # home directory needed for pip package installation
 RUN mkdir -p /app/auto_news
 RUN chown -R autonews:autonews /app
 USER autonews
 RUN export PATH=/home/autonews/.local/bin:$PATH
 COPY requirements.txt /app/
 RUN python3 -m pip install -r /app/requirements.txt
 COPY app /app/auto_news
 WORKDIR /app/auto_news
-
+RUN python3 -m pip install -r requirements.txt
 ENTRYPOINT ["python3", "runner.py"]
--- a/README.md
+++ b/README.md
@ -3,7 +3,8 @@
 A utility to fetch article requests from slack and generate pdfs for them, fully automatically.
-## Running
+## Running - Pure docker
 > I recommend running with docker compose instead
 ### How to run - auto archiving mode
 In this mode the program is launched as a docker container, in a headless mode. For persistence purposes a local storage volume is required, but that's it!
@ -15,6 +16,12 @@ You can specify additional parameters:
 `docker run -it -v <your storage>:/app/file_storage/ auto_news upload` catches up on incomplete uploads to archive.
 `docker run -it -v <your storage>:/app/file_storage/ auto_news reducedfetch` makes assumption about the status of the slack chat and greatly reduces the number of api calls (faster start up).
 These parameters can be combined (mostyl for testing I guess)
 Finally for manual file verification:
 `docker run -it -v <your storage>:/app/file_storage/ -e DISPLAY=":0" --network host -v \$XAUTHORITY:/root/.Xauthority auto_news check` lets you visually verify the downloaded files. The additional parameters are required in order to open guis on the host.
@ -24,33 +31,51 @@ In this mode, a docker container is launched with an additional volume, the loca
 `docker run -it -v <your storage>:/app/file_storage/ -v <your code>:/code/ --entry-point /bin/bash auto_news`
 You are droppped into a bash shell, in which you can navigate to the `/code` directory and then test live.
 ### Cheat-sheet Remy:
 `docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ auto_news`
 `docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ -v /mnt/Data/COSS/Development/auto_news/app:/code --entrypoint /bin/bash auto_news`
 `docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ -e DISPLAY=":0" --network host -v XAUTHORITY:/root/.Xauthority auto_news check`
 ## Running - Docker compose 
 I also wrote a rudimentary docker compose file which makes running much more simple. Just run
 `docker compose --env-file <desired mode> up`
 All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions: production, debug (development in general), upload and check. These files will have to be adapted to your individual setup but can be reused more easily.
 > Note:
 >
 > The `debug` requires additional input. Once `docker compose up` is running, in a new session run `docker compose --env-file env/debug exec bash`. The live-mounted code is then under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. 
 ## Building
-### Things to keep in mind
+> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.
 The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.
 In docker, simply run:
 `docker build -t auto_news --no-cache .`
 where the `Dockerfile` has to be in the working directory
 In docker compose, run the usual command, but append 
 `docker compose ... up --build`
 ## Cheat-sheet Remy:
 `docker run -it -v /mnt/Data/COSS/CONTAINERDATA/:/app/file_storage/ auto_news`
 `docker run -it -v /mnt/Data/COSS/CONTAINERDATA/:/app/file_storage/ -v /mnt/Data/COSS/auto_news/app:/code --entrypoint /bin/bash auto_news`
 `docker run -it -v /mnt/Data/COSS/CONTAINERDATA/:/app/file_storage/ -e DISPLAY=":0" --network host -v XAUTHORITY:/root/.Xauthority auto_news check`
 ## Roadmap:
 [ ] automatically upload files to NAS
-[ ] handle paywalled sites like faz, spiegel, .. through their dedicated edu-sites
+
 [ ] handle paywalled sites like faz, spiegel, .. through their dedicated edu-friendly sites
 ...
--- a/app/configuration.py
+++ b/app/configuration.py
@ -1,5 +1,4 @@
 import os
 import sys
 import configparser
 import logging
 from peewee import SqliteDatabase
@ -19,18 +18,18 @@ logger = logging.getLogger(__name__)
 parsed = configparser.ConfigParser()
 parsed.read("/app/file_storage/config.ini")
-if "debug" in sys.argv:
+if os.getenv("DEBUG", "false") == "true":
-    logger.warning("Running in debugging mode because launched with argument 'debug'")
+    logger.warning("Found 'DEBUG=true', setting up dummy databases")
    # parsed.read("/code/config.ini")
    db_base_path = parsed["DATABASE"]["db_path_dev"]
    parsed["SLACK"]["archive_id"] = parsed["SLACK"]["debug_id"]
    parsed["MAIL"]["recipient"] = parsed["MAIL"]["sender"] 
 else:
-    logger.warning("Using production values, I hope you know what you're doing...")
+    logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
    db_base_path = parsed["DATABASE"]["db_path_prod"]
 from utils_storage import models
 # Set up the database
--- a/app/requirements.txt
+++ b/app/requirements.txt
--- a/app/runner.py
+++ b/app/runner.py
@ -1,9 +1,9 @@
 """Main coordination of other util classes. Handles inbound and outbound calls"""
 import configuration
 models = configuration.models
 import sys
 from threading import Thread
 import logging
 import os
 logger = logging.getLogger(__name__)
 from utils_mail import runner as mail_runner
@ -172,12 +172,12 @@ if __name__ == "__main__":
    coordinator = Coordinator()
-    if "upload" in sys.argv:
+    if os.getenv("UPLOAD", "false") == "true":
        articles = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "").execute()
        logger.info(f"Launching upload to archive for {len(articles)} articles.")
        coordinator.manual_processing(articles, [UploadWorker()])
-    elif "check" in sys.argv:
+    elif os.getenv("CHECK", "false") == "true":
        from utils_check import runner as check_runner
        check_runner.verify_unchecked()
--- a/app/utils_slack/message_helpers.py
+++ b/app/utils_slack/message_helpers.py
@ -3,7 +3,6 @@ import configuration
 import requests
 import os
 import time
 import sys
 from threading import Thread
 from slack_sdk.errors import SlackApiError
@ -30,10 +29,10 @@ def init(client) -> None:
    t = Thread(target = fetch_missed_channel_reactions) # threaded, runs in background (usually takes a long time)
    t.start()
-    if "reducedfetch" in sys.argv:
+    if os.getenv("REDUCEDFETCH", "false") == "true":
-        logger.warning("Only fetching empty threads for bot messages because of argument 'reducedfetch'")
+        logger.warning("Only fetching empty threads for bot messages because 'REDUCEDFETCH=true'")
        fetch_missed_thread_messages(reduced=True)
-    else:    # perform these two asyncronously
+    else: # perform both asyncronously
        fetch_missed_thread_messages()
--- a/app/utils_worker/download/browser.py
+++ b/app/utils_worker/download/browser.py
@ -2,7 +2,6 @@ import time
 import datetime
 import logging
 import os
 import sys
 import base64
 import requests
 from selenium import webdriver
@ -20,28 +19,34 @@ class PDFDownloader:
    running = False
    def start(self):
-        options=Options()
+        try:
            self.finish()
        except:
            self.logger.info("gecko driver not yet running")
        options = webdriver.FirefoxOptions()
        options.profile = config["browser_profile_path"]
-        if "notheadless" in sys.argv:
+        # should be options.set_preference("profile", config["browser_profile_path"]) as of selenium 4 but that doesn't work
-            self.logger.warning("Opening browser GUI because of Argument 'notheadless'")
+
-        else:
+        if os.getenv("HEADLESS", "false") == "true":
            options.add_argument('--headless')
        else:
            self.logger.warning("Opening browser GUI because of 'HEADLESS=true'")
        # Print to pdf
        options.set_preference("print_printer", "Mozilla Save to PDF")
        options.set_preference("print.always_print_silent", True)
        options.set_preference("print.show_print_progress", False)
        options.set_preference('print.save_as_pdf.links.enabled', True)
        # Just save if the filetype is pdf already, does not work!
        options.set_preference("print.printer_Mozilla_Save_to_PDF.print_to_file", True)
        options.set_preference("browser.download.folderList", 2)
        # options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
        # options.set_preference("pdfjs.disabled", True)
        options.set_preference("browser.download.dir", config["default_download_path"])
-        self.logger.info("Now Starting gecko driver")
+        self.logger.info("Starting gecko driver")
-        self.driver = webdriver.Firefox(options=options)
+        self.driver = webdriver.Firefox(
            options = options,
            service = webdriver.firefox.service.Service(
                log_path = f'{config["local_storage_path"]}/geckodriver.log'
        ))
        residues = os.listdir(config["default_download_path"])
        for res in residues:
@ -54,6 +59,7 @@ class PDFDownloader:
            self.start() # relaunch the dl util    
    def finish(self):
        self.logger.info("Exiting gecko driver")
        self.driver.quit()
        self.running = False
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@ -0,0 +1,36 @@
 # docker compose --env-file env/debug up 
 version: "3.9"
 services:
  auto_news:
    build: .
    volumes:
      - ${CONTAINER_DATA}:/app/file_storage
      - ${HOSTS_FILE}:/etc/hosts
      - ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
      - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority
    network_mode: host
    environment:
      - DISPLAY=$DISPLAY
      - DEBUG=${DEBUG}
      - CHECK=${CHECK}
      - UPLOAD=${UPLOAD}
      - HEADLESS=${HEADLESS}
      - REDUCEDFETCH=${REDUCEDFETCH}
    entrypoint: ${ENTRYPOINT:-"python3 runner.py"} # by default launch workers as defined in the Dockerfile
  # geckodriver:
  #   image: selenium/standalone-firefox:100.0
  #   volumes: 
  #     
  #     - ${CONTAINER_DATA-/dev/null}:/app/file_storage
  #     - ${FIREFOX_PROFILE}:/auto_news.profile
  #     - ${HOSTS_FILE}:/etc/hosts
  #   environment:
  #     - DISPLAY=$DISPLAY
  #     - START_XVFB=false
  #   network_mode: host
--- a/env/check
+++ b/env/check
@ -0,0 +1,12 @@
 # Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth
 CONTAINER_DATA=/mnt/Data/COSS/Downloads/auto_news.container
 HOSTS_FILE=/mnt/Data/COSS/Downloads/auto_news.container/dependencies/hosts
 XAUTHORTIY=$XAUTHORTIY
 DEBUG=false
 CHECK=true
 HEADLESS=true
 UPLOAD=false
 REDUCEDFETCH=false
--- a/env/debug
+++ b/env/debug
@ -0,0 +1,15 @@
 # Runs in a debugging mode, does not launch anything at all but starts a bash process
 CONTAINER_DATA=/mnt/Data/COSS/Downloads/auto_news.container
 HOSTS_FILE=/mnt/Data/COSS/Downloads/auto_news.container/dependencies/hosts
 CODE=./
 XAUTHORTIY=$XAUTHORTIY
 DEBUG=true
 CHECK=false
 UPLOAD=false
 HEADLESS=false
 REDUCEDFETCH=false
 ENTRYPOINT="sleep infinity"
--- a/env/production
+++ b/env/production
@ -0,0 +1,10 @@
 # Runs on the main slack channel with the full worker setup. If nothing funky has occured, reducedfetch is a speedup
 CONTAINER_DATA=/mnt/Data/Downloads/auto_news.container
 HOSTS_FILE=/mnt/Data/COSS/Downloads/auto_news.container/dependencies/hosts
 DEBUG=false
 CHECK=false
 UPLOAD=false
 HEADLESS=true
 REDUCEDFETCH=true
--- a/env/upload
+++ b/env/upload
@ -0,0 +1,11 @@
 # Does not run any other workers and only upploads to archive the urls that weren't previously uploaded
 CONTAINER_DATA=/mnt/Data/COSS/Downloads/auto_news.container
 HOSTS_FILE=/mnt/Data/COSS/Downloads/auto_news.container/dependencies/hosts
 DEBUG=false
 CHECK=false
 UPLOAD=true
 HEADLESS=true
 REDUCEDFETCH=false