Switched to docker compose and wasted hours trying to have standalone firefox

This commit is contained in:
Remy Moll 2022-05-29 18:29:31 +02:00
parent 878a1dff5d
commit 54760abee4
13 changed files with 154 additions and 42 deletions

View File

@ -1 +1,2 @@
.dev/
.dev/
__pycache__/

View File

@ -1,6 +1,8 @@
FROM python:latest
ENV TZ Euopre/Zurich
RUN echo "deb http://deb.debian.org/debian/ unstable main contrib non-free" >> /etc/apt/sources.list
RUN apt-get update && apt-get install -y \
evince \
@ -16,7 +18,6 @@ RUN wget https://github.com/mozilla/geckodriver/releases/download/v0.31.0/geckod
RUN tar -x geckodriver -zf geckodriver-v0.31.0-linux64.tar.gz -O > /usr/bin/geckodriver
RUN chmod +x /usr/bin/geckodriver
RUN rm geckodriver-v0.31.0-linux64.tar.gz
RUN echo "127.0.0.1 localhost" >> /etc/hosts
RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
@ -24,15 +25,12 @@ RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
# home directory needed for pip package installation
RUN mkdir -p /app/auto_news
RUN chown -R autonews:autonews /app
USER autonews
RUN export PATH=/home/autonews/.local/bin:$PATH
COPY requirements.txt /app/
RUN python3 -m pip install -r /app/requirements.txt
COPY app /app/auto_news
WORKDIR /app/auto_news
RUN python3 -m pip install -r requirements.txt
ENTRYPOINT ["python3", "runner.py"]

View File

@ -3,7 +3,8 @@
A utility to fetch article requests from slack and generate pdfs for them, fully automatically.
## Running
## Running - Pure docker
> I recommend running with docker compose instead
### How to run - auto archiving mode
In this mode the program is launched as a docker container, in a headless mode. For persistence purposes a local storage volume is required, but that's it!
@ -15,6 +16,12 @@ You can specify additional parameters:
`docker run -it -v <your storage>:/app/file_storage/ auto_news upload` catches up on incomplete uploads to archive.
`docker run -it -v <your storage>:/app/file_storage/ auto_news reducedfetch` makes assumption about the status of the slack chat and greatly reduces the number of api calls (faster start up).
These parameters can be combined (mostyl for testing I guess)
Finally for manual file verification:
`docker run -it -v <your storage>:/app/file_storage/ -e DISPLAY=":0" --network host -v \$XAUTHORITY:/root/.Xauthority auto_news check` lets you visually verify the downloaded files. The additional parameters are required in order to open guis on the host.
@ -24,33 +31,51 @@ In this mode, a docker container is launched with an additional volume, the loca
`docker run -it -v <your storage>:/app/file_storage/ -v <your code>:/code/ --entry-point /bin/bash auto_news`
You are droppped into a bash shell, in which you can navigate to the `/code` directory and then test live.
### Cheat-sheet Remy:
`docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ auto_news`
`docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ -v /mnt/Data/COSS/Development/auto_news/app:/code --entrypoint /bin/bash auto_news`
`docker run -it -v /mnt/Data/COSS/Downloads/auto_news.container/:/app/file_storage/ -e DISPLAY=":0" --network host -v XAUTHORITY:/root/.Xauthority auto_news check`
## Running - Docker compose
I also wrote a rudimentary docker compose file which makes running much more simple. Just run
`docker compose --env-file <desired mode> up`
All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions: production, debug (development in general), upload and check. These files will have to be adapted to your individual setup but can be reused more easily.
> Note:
>
> The `debug` requires additional input. Once `docker compose up` is running, in a new session run `docker compose --env-file env/debug exec bash`. The live-mounted code is then under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`.
## Building
### Things to keep in mind
The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.
> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.
In docker, simply run:
`docker build -t auto_news --no-cache .`
where the `Dockerfile` has to be in the working directory
In docker compose, run the usual command, but append
`docker compose ... up --build`
## Cheat-sheet Remy:
`docker run -it -v /mnt/Data/COSS/CONTAINERDATA/:/app/file_storage/ auto_news`
`docker run -it -v /mnt/Data/COSS/CONTAINERDATA/:/app/file_storage/ -v /mnt/Data/COSS/auto_news/app:/code --entrypoint /bin/bash auto_news`
`docker run -it -v /mnt/Data/COSS/CONTAINERDATA/:/app/file_storage/ -e DISPLAY=":0" --network host -v XAUTHORITY:/root/.Xauthority auto_news check`
## Roadmap:
[ ] automatically upload files to NAS
[ ] handle paywalled sites like faz, spiegel, .. through their dedicated edu-sites
[ ] handle paywalled sites like faz, spiegel, .. through their dedicated edu-friendly sites
...

View File

@ -1,5 +1,4 @@
import os
import sys
import configparser
import logging
from peewee import SqliteDatabase
@ -19,18 +18,18 @@ logger = logging.getLogger(__name__)
parsed = configparser.ConfigParser()
parsed.read("/app/file_storage/config.ini")
if "debug" in sys.argv:
logger.warning("Running in debugging mode because launched with argument 'debug'")
# parsed.read("/code/config.ini")
if os.getenv("DEBUG", "false") == "true":
logger.warning("Found 'DEBUG=true', setting up dummy databases")
db_base_path = parsed["DATABASE"]["db_path_dev"]
parsed["SLACK"]["archive_id"] = parsed["SLACK"]["debug_id"]
parsed["MAIL"]["recipient"] = parsed["MAIL"]["sender"]
else:
logger.warning("Using production values, I hope you know what you're doing...")
logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
db_base_path = parsed["DATABASE"]["db_path_prod"]
from utils_storage import models
# Set up the database

View File

@ -1,9 +1,9 @@
"""Main coordination of other util classes. Handles inbound and outbound calls"""
import configuration
models = configuration.models
import sys
from threading import Thread
import logging
import os
logger = logging.getLogger(__name__)
from utils_mail import runner as mail_runner
@ -172,12 +172,12 @@ if __name__ == "__main__":
coordinator = Coordinator()
if "upload" in sys.argv:
if os.getenv("UPLOAD", "false") == "true":
articles = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "").execute()
logger.info(f"Launching upload to archive for {len(articles)} articles.")
coordinator.manual_processing(articles, [UploadWorker()])
elif "check" in sys.argv:
elif os.getenv("CHECK", "false") == "true":
from utils_check import runner as check_runner
check_runner.verify_unchecked()

View File

@ -3,7 +3,6 @@ import configuration
import requests
import os
import time
import sys
from threading import Thread
from slack_sdk.errors import SlackApiError
@ -30,10 +29,10 @@ def init(client) -> None:
t = Thread(target = fetch_missed_channel_reactions) # threaded, runs in background (usually takes a long time)
t.start()
if "reducedfetch" in sys.argv:
logger.warning("Only fetching empty threads for bot messages because of argument 'reducedfetch'")
if os.getenv("REDUCEDFETCH", "false") == "true":
logger.warning("Only fetching empty threads for bot messages because 'REDUCEDFETCH=true'")
fetch_missed_thread_messages(reduced=True)
else: # perform these two asyncronously
else: # perform both asyncronously
fetch_missed_thread_messages()

View File

@ -2,7 +2,6 @@ import time
import datetime
import logging
import os
import sys
import base64
import requests
from selenium import webdriver
@ -20,28 +19,34 @@ class PDFDownloader:
running = False
def start(self):
options=Options()
try:
self.finish()
except:
self.logger.info("gecko driver not yet running")
options = webdriver.FirefoxOptions()
options.profile = config["browser_profile_path"]
if "notheadless" in sys.argv:
self.logger.warning("Opening browser GUI because of Argument 'notheadless'")
else:
# should be options.set_preference("profile", config["browser_profile_path"]) as of selenium 4 but that doesn't work
if os.getenv("HEADLESS", "false") == "true":
options.add_argument('--headless')
else:
self.logger.warning("Opening browser GUI because of 'HEADLESS=true'")
# Print to pdf
options.set_preference("print_printer", "Mozilla Save to PDF")
options.set_preference("print.always_print_silent", True)
options.set_preference("print.show_print_progress", False)
options.set_preference('print.save_as_pdf.links.enabled', True)
# Just save if the filetype is pdf already, does not work!
options.set_preference("print.printer_Mozilla_Save_to_PDF.print_to_file", True)
options.set_preference("browser.download.folderList", 2)
# options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
# options.set_preference("pdfjs.disabled", True)
options.set_preference("browser.download.dir", config["default_download_path"])
self.logger.info("Now Starting gecko driver")
self.driver = webdriver.Firefox(options=options)
self.logger.info("Starting gecko driver")
self.driver = webdriver.Firefox(
options = options,
service = webdriver.firefox.service.Service(
log_path = f'{config["local_storage_path"]}/geckodriver.log'
))
residues = os.listdir(config["default_download_path"])
for res in residues:
@ -54,6 +59,7 @@ class PDFDownloader:
self.start() # relaunch the dl util
def finish(self):
self.logger.info("Exiting gecko driver")
self.driver.quit()
self.running = False

36
docker-compose.yaml Normal file
View File

@ -0,0 +1,36 @@
# docker compose --env-file env/debug up
version: "3.9"
services:
auto_news:
build: .
volumes:
- ${CONTAINER_DATA}:/app/file_storage
- ${HOSTS_FILE}:/etc/hosts
- ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
- ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority
network_mode: host
environment:
- DISPLAY=$DISPLAY
- DEBUG=${DEBUG}
- CHECK=${CHECK}
- UPLOAD=${UPLOAD}
- HEADLESS=${HEADLESS}
- REDUCEDFETCH=${REDUCEDFETCH}
entrypoint: ${ENTRYPOINT:-"python3 runner.py"} # by default launch workers as defined in the Dockerfile
# geckodriver:
# image: selenium/standalone-firefox:100.0
# volumes:
#
# - ${CONTAINER_DATA-/dev/null}:/app/file_storage
# - ${FIREFOX_PROFILE}:/auto_news.profile
# - ${HOSTS_FILE}:/etc/hosts
# environment:
# - DISPLAY=$DISPLAY
# - START_XVFB=false
# network_mode: host

12
env/check vendored Normal file
View File

@ -0,0 +1,12 @@
# Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth
CONTAINER_DATA=/mnt/Data/COSS/Downloads/auto_news.container
HOSTS_FILE=/mnt/Data/COSS/Downloads/auto_news.container/dependencies/hosts
XAUTHORTIY=$XAUTHORTIY
DEBUG=false
CHECK=true
HEADLESS=true
UPLOAD=false
REDUCEDFETCH=false

15
env/debug vendored Normal file
View File

@ -0,0 +1,15 @@
# Runs in a debugging mode, does not launch anything at all but starts a bash process
CONTAINER_DATA=/mnt/Data/COSS/Downloads/auto_news.container
HOSTS_FILE=/mnt/Data/COSS/Downloads/auto_news.container/dependencies/hosts
CODE=./
XAUTHORTIY=$XAUTHORTIY
DEBUG=true
CHECK=false
UPLOAD=false
HEADLESS=false
REDUCEDFETCH=false
ENTRYPOINT="sleep infinity"

10
env/production vendored Normal file
View File

@ -0,0 +1,10 @@
# Runs on the main slack channel with the full worker setup. If nothing funky has occured, reducedfetch is a speedup
CONTAINER_DATA=/mnt/Data/Downloads/auto_news.container
HOSTS_FILE=/mnt/Data/COSS/Downloads/auto_news.container/dependencies/hosts
DEBUG=false
CHECK=false
UPLOAD=false
HEADLESS=true
REDUCEDFETCH=true

11
env/upload vendored Normal file
View File

@ -0,0 +1,11 @@
# Does not run any other workers and only upploads to archive the urls that weren't previously uploaded
CONTAINER_DATA=/mnt/Data/COSS/Downloads/auto_news.container
HOSTS_FILE=/mnt/Data/COSS/Downloads/auto_news.container/dependencies/hosts
DEBUG=false
CHECK=false
UPLOAD=true
HEADLESS=true
REDUCEDFETCH=false