Compare commits
4 Commits
4e2245044f
...
afead44d6c
Author | SHA1 | Date | |
---|---|---|---|
afead44d6c | |||
713406dc67 | |||
2e65828bbb | |||
60c9e88c7b |
30
.gitignore
vendored
30
.gitignore
vendored
@ -1,5 +1,33 @@
|
||||
.dev/
|
||||
|
||||
.vscode/
|
||||
*.pyc
|
||||
*.log
|
||||
__pycache__/
|
||||
|
||||
|
||||
|
||||
## svelte:
|
||||
# Logs
|
||||
logs
|
||||
*.log
|
||||
npm-debug.log*
|
||||
yarn-debug.log*
|
||||
yarn-error.log*
|
||||
pnpm-debug.log*
|
||||
lerna-debug.log*
|
||||
|
||||
node_modules
|
||||
dist
|
||||
dist-ssr
|
||||
*.local
|
||||
|
||||
# Editor directories and files
|
||||
.vscode/*
|
||||
!.vscode/extensions.json
|
||||
.idea
|
||||
.DS_Store
|
||||
*.suo
|
||||
*.ntvs*
|
||||
*.njsproj
|
||||
*.sln
|
||||
*.sw?
|
||||
|
87
README.md
87
README.md
@ -11,18 +11,15 @@ A utility to
|
||||
... fully automatically. Run it now, thank me later.
|
||||
|
||||
---
|
||||
## Running - Docker compose
|
||||
|
||||
The included `docker-compose` file is now necessary for easy orchestration of the various services.
|
||||
## Running - through launch file
|
||||
> Prerequisite: make `launch.cexecutable:
|
||||
>
|
||||
> `chmod +x launch`
|
||||
|
||||
All relevant passthroughs and mounts are specified through the env-file, for which I configured 4 versions:
|
||||
Execute the file by runnning `./launch`. This won't do anything in itself. You need to specify a mode, and then a command
|
||||
|
||||
* production
|
||||
* debug (development in general)
|
||||
* upload
|
||||
* check
|
||||
|
||||
These files will have to be adapted to your individual setup but won't change significantly once set up.
|
||||
`./launch <mode> <command> <command options>`
|
||||
|
||||
### Overview of the modes
|
||||
|
||||
@ -30,47 +27,67 @@ The production mode performs all automatic actions and therfore does not require
|
||||
|
||||
The debug mode is more sophisticated and allows for big code changes without the need to recompile. It directly mounts the code-directory into the cotainer. As a failsafe the environment-variable `DEBUG=true` is set. The whole utility is then run on a sandbox environment (slack-channel, database, email) so that Dirk is not affected by any mishaps.
|
||||
|
||||
The check mode is less sophisticated but shows the downloaded articles to the host for visual verification. This requires passthroughs for X11.
|
||||
|
||||
Upload mode is much simpler, it goes over the exisiting database and operates on the articles, where the upload to archive.org has not yet occured (archive.org is slow and the other operations usually finish before the queue was consumed). It retries their upload.
|
||||
|
||||
* For normal `production` mode run:
|
||||
|
||||
`docker compose --env-file env/production run news_fetch`
|
||||
Two additional 'modes' are `build` and `down`. Build rebuilds the container, which is necessary after code changes. Down ensures a clean shutdown of *all* containers. Usually the launch-script handles this already but it sometimes fails, in which case `down` needs to be called again.
|
||||
|
||||
|
||||
* For `debug` mode run:
|
||||
### Overview of the commands
|
||||
|
||||
`docker compose --env-file env/debug run news_fetch`
|
||||
|
||||
which drops you into an interactive shell (`ctrl+d` to exit the container shell).
|
||||
In essence a command is simply a service from docker-compose, which is run in an interactive environment. As such all services defined in `docker-compose.yaml` can be called as commands. Only two of them will be of real use:
|
||||
|
||||
> Note:
|
||||
> The live-mounted code is now under `/code`. Note that the `DEBUG=true` environment variable is still set. If you want to test things on production, run `export DEBUG=false`. Running `python runner.py` will now run the newly written code but, with the production database and storage.
|
||||
`news_fetch` does the majority of the actions mentionned above. By default, that is without any options, it runs a metadata-fetch, download, compression, and upload to archive.org. The upload is usually the slowest which is why articles that are processed but don't yet have an archive.org url tend to pile up. You can therefore specify the option `upload` which only starts the upload for the concerned articles, as a catch-up if you will.
|
||||
|
||||
* For `check` mode, some env-variables are also changed and you still require interactivity. You don't need the geckodriver service however. The simplest way is to run
|
||||
Example usage:
|
||||
|
||||
`docker compose --env-file env/check run --no-deps --rm news_fetch`
|
||||
```bash
|
||||
./launch production news_fetch # full mode
|
||||
./launch production news_fetch upload # upload mode (lighter resource usage)
|
||||
./launch debug news_fetch # debug mode, which drops you inside a new shell
|
||||
|
||||
* Finally, for `upload` mode no interactivity is required and no additional services are required. Simply run:
|
||||
|
||||
`docker compose --env-file env/upload run --no-deps --rm news_fetch`
|
||||
./launch production news_check
|
||||
```
|
||||
|
||||
### Stopping
|
||||
Run
|
||||
`news_check` starts a webapp, accessible under [http://localhost:8080](http://localhost:8080) and allows you to easily check the downloaded articles.
|
||||
|
||||
`docker compose --env-file env/production down`
|
||||
|
||||
which terminates all containers associated with the `docker-compose.yaml`.
|
||||
## (Running - Docker compose)
|
||||
> I strongly recommend sticking to the usage of `./launch`.
|
||||
|
||||
Instead of using the launch file you can manually issue `docker compose` comands. Example: check for logs.
|
||||
|
||||
All relevant mounts and env-variables are easiest specified through the env-file, for which I configured 2 versions:
|
||||
|
||||
* production
|
||||
* debug (development in general)
|
||||
|
||||
These files will have to be adapted to your individual setup but won't change significantly once set up.
|
||||
|
||||
Example usage:
|
||||
|
||||
```bash
|
||||
docker compose --env-file env/production run news_fetch # full mode
|
||||
docker compose --env-file env/production run news_fetch upload # upload mode (lighter resource usage)
|
||||
docker compose --env-file env/debug run news_fetch # debug mode, which drops you inside a new shell
|
||||
|
||||
docker copose --env-file env/production news_check
|
||||
|
||||
# Misc:
|
||||
docker compose --env-file env/production up # starts all services and shows their combined logs
|
||||
docker compose --env-file env/production logs -f news_fetch # follows along with the logs of only one service
|
||||
docker compose --env-file env/production down
|
||||
```
|
||||
|
||||
|
||||
## Building
|
||||
|
||||
> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly clean build the docker image! This is also crucial to update the code itself.
|
||||
> The software (firefox, selenium, python) changes frequently. For non-breaking changes it is useful to regularly re build the docker image! This is also crucial to update the code itself.
|
||||
|
||||
In docker compose, run
|
||||
|
||||
`docker compose --env-file env/production build`
|
||||
|
||||
Or simpler, just run
|
||||
|
||||
`./launch build`
|
||||
|
||||
|
||||
|
||||
@ -80,6 +97,10 @@ In docker compose, run
|
||||
|
||||
|
||||
## Manual Sync to NAS:
|
||||
Manual sync is sadly still necessary, as the lsync client, sometimes gets overwhelmed by quick writes.
|
||||
|
||||
I use `rsync`. Mounting the NAS locally, I navigate to the location of the local folder (notice the trailing slash). Then run
|
||||
`rsync -Razq --no-perms --no-owner --no-group --temp-dir=/tmp --progress --log-file=rsync.log <local folder>/ "<remote>"`
|
||||
where `<remote>` is the location where the NAS is mounted. (options:`R` - relative paths , `a` - archive mode (multiple actions), `z` - ??, `q` - quiet. We also don't copy most of the metadata and we keep a log of the transfers.)
|
||||
where `<remote>` is the location where the NAS is mounted. (options:`R` - relative paths , `a` - archive mode (multiple actions), `z` - ??, `q` - quiet. We also don't copy most of the metadata and we keep a log of the transfers.)
|
||||
|
||||
You can also use your OS' native copy option and select *de not overwrite*. This should only copy the missing files, significantly speeding up the operation.
|
@ -1,25 +1,8 @@
|
||||
# Usage:
|
||||
# docker compose --env-file env/<mode> run <args> news_fetch && docker-compose --env-file env/production down
|
||||
|
||||
version: "3.9"
|
||||
|
||||
services:
|
||||
|
||||
geckodriver:
|
||||
image: selenium/standalone-firefox:103.0
|
||||
volumes:
|
||||
- ${XSOCK-/dev/null}:${XSOCK-/tmp/sock}
|
||||
- ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority
|
||||
environment:
|
||||
- DISPLAY=$DISPLAY
|
||||
- START_VNC=false
|
||||
- START_XVFB=false
|
||||
user: 1001:1001
|
||||
expose: # exposed to other docker-compose services only
|
||||
- "4444"
|
||||
|
||||
|
||||
vpn:
|
||||
vpn: # Creates a connection behind the ETH Firewall to access NAS and Postgres
|
||||
image: wazum/openconnect-proxy:latest
|
||||
env_file:
|
||||
- ${CONTAINER_DATA}/config/vpn.config
|
||||
@ -28,13 +11,14 @@ services:
|
||||
volumes:
|
||||
- /dev/net/tun:/dev/net/tun
|
||||
# alternative to cap_add & volumes: specify privileged: true
|
||||
expose: ["5432"] # exposed here because db_passhtrough uses this network. See below for more details
|
||||
|
||||
|
||||
|
||||
nas_sync:
|
||||
nas_sync: # Syncs locally downloaded files with the NAS-share on nas22.ethz.ch/...
|
||||
depends_on:
|
||||
- vpn # used to establish a connection to the SMB server
|
||||
network_mode: "service:vpn"
|
||||
build: nas_sync
|
||||
- vpn
|
||||
network_mode: "service:vpn" # used to establish a connection to the SMB server from inside ETH network
|
||||
build: nas_sync # local folder to build
|
||||
image: nas_sync:latest
|
||||
cap_add: # capabilities needed for mounting the SMB share
|
||||
- SYS_ADMIN
|
||||
@ -49,27 +33,57 @@ services:
|
||||
- /sync/nas_sync.config
|
||||
|
||||
|
||||
news_fetch:
|
||||
geckodriver: # separate docker container for pdf-download. This hugely improves stability (and creates shorter build times for the containers)
|
||||
image: ${GECKODRIVER_IMG}
|
||||
environment:
|
||||
- START_VNC=${HEADFULL-false} # as opposed to headless, used when requiring supervision (eg. for websites that crash)
|
||||
- START_XVFB=${HEADFULL-false}
|
||||
- SE_VNC_NO_PASSWORD=1
|
||||
expose: ["4444"] # exposed to other docker-compose services only
|
||||
ports:
|
||||
- 7900:7900 # port for webvnc
|
||||
|
||||
|
||||
db_passthrough: # Allows a container on the local network to connect to a service (here postgres) through the vpn
|
||||
network_mode: "service:vpn"
|
||||
image: alpine/socat:latest
|
||||
command: ["tcp-listen:5432,reuseaddr,fork", "tcp-connect:id-hdb-psgr-cp48.ethz.ch:5432"]
|
||||
# expose: ["5432"] We would want this passthrough to expose its ports to the other containers
|
||||
# BUT since it uses the same network as the vpn-service, it can't expose ports on its own. 5432 is therefore exposed under service.vpn.expose
|
||||
|
||||
|
||||
news_fetch: # Orchestration of the automatic download. It generates pdfs (via the geckodriver container), fetches descriptions, triggers a snaphsot (on archive.org) and writes to a db
|
||||
build: news_fetch
|
||||
image: news_fetch:latest
|
||||
|
||||
depends_on: # when using docker compose run news_fetch, the dependencies are started as well
|
||||
- nas_sync
|
||||
- geckodriver
|
||||
- db_passthrough
|
||||
|
||||
volumes:
|
||||
- ${CONTAINER_DATA}:/app/containerdata # always set
|
||||
- ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
|
||||
- ${XSOCK-/dev/null}:${XSOCK-/tmp/sock} # x11 socket, needed for gui
|
||||
# - ${XAUTHORITY-/dev/null}:/home/auto_news/.Xauthority # xauth needed for authenticating to x11
|
||||
environment:
|
||||
- DISPLAY=$DISPLAY # needed to let x11 apps know where to connect to
|
||||
|
||||
- DEBUG=${DEBUG}
|
||||
- CHECK=${CHECK}
|
||||
- UPLOAD=${UPLOAD}
|
||||
- HEADLESS=${HEADLESS}
|
||||
- REDUCEDFETCH=${REDUCEDFETCH}
|
||||
entrypoint: ${ENTRYPOINT:-python3 runner.py} # by default launch workers as defined in the Dockerfile
|
||||
stdin_open: ${INTERACTIVE:-false} # docker run -i
|
||||
tty: ${INTERACTIVE:-false} # docker run -t
|
||||
- UNAME=${UNAME}
|
||||
entrypoint: ${ENTRYPOINT:-python runner.py} # by default launch workers as defined in the Dockerfile
|
||||
# stdin_open: ${INTERACTIVE:-false} # docker run -i
|
||||
# tty: ${INTERACTIVE:-false} # docker run -t
|
||||
|
||||
|
||||
news_check: # Creates a small webapp on http://localhost:8080 to check previously generated pdfs (some of which are unusable and must be marked as such)
|
||||
build: news_check
|
||||
image: news_check:latest
|
||||
# user: 1001:1001 # since the app writes files to the local filesystem, it must be run as the current user
|
||||
depends_on:
|
||||
- db_passthrough
|
||||
volumes:
|
||||
- ${CONTAINER_DATA}:/app/containerdata # always set
|
||||
- ${CODE:-/dev/null}:/code # not set in prod, defaults to /dev/null
|
||||
environment:
|
||||
- UNAME=${UNAME}
|
||||
ports:
|
||||
- "8080:80" # 80 inside container
|
||||
entrypoint: ${ENTRYPOINT:-python app.py} # by default launch workers as defined in the Dockerfile
|
||||
tty: true
|
||||
|
15
env/check
vendored
15
env/check
vendored
@ -1,15 +0,0 @@
|
||||
# Does not run any downloads but displays the previously downloaded but not yet checked files. Requires display-acces via xauth
|
||||
|
||||
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||
|
||||
XAUTHORTIY=$XAUTHORTIY
|
||||
XSOCK=/tmp/.X11-unix
|
||||
|
||||
DEBUG=false
|
||||
CHECK=true
|
||||
HEADLESS=true
|
||||
UPLOAD=false
|
||||
REDUCEDFETCH=false
|
||||
|
||||
# ENTRYPOINT="/bin/bash"
|
||||
INTERACTIVE=true
|
20
env/debug
vendored
20
env/debug
vendored
@ -1,16 +1,10 @@
|
||||
# Runs in a debugging mode, does not launch anything at all but starts a bash process
|
||||
|
||||
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||
export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||
export UNAME=remy
|
||||
|
||||
CODE=./
|
||||
XAUTHORTIY=$XAUTHORTIY
|
||||
XSOCK=/tmp/.X11-unix
|
||||
|
||||
DEBUG=true
|
||||
CHECK=false
|
||||
UPLOAD=false
|
||||
HEADLESS=false
|
||||
REDUCEDFETCH=false
|
||||
|
||||
ENTRYPOINT="/bin/bash"
|
||||
INTERACTIVE=true
|
||||
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
|
||||
export DEBUG=true
|
||||
export HEADFULL=true
|
||||
export CODE=./
|
||||
export ENTRYPOINT=/bin/bash
|
||||
|
9
env/production
vendored
9
env/production
vendored
@ -2,9 +2,6 @@
|
||||
|
||||
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||
|
||||
CONTAINERS_TO_RUN=nas_sync, geckodriver
|
||||
DEBUG=false
|
||||
CHECK=false
|
||||
UPLOAD=false
|
||||
HEADLESS=true
|
||||
REDUCEDFETCH=true
|
||||
export UNAME=remy
|
||||
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
|
||||
export DEBUG=false
|
||||
|
10
env/upload
vendored
10
env/upload
vendored
@ -1,10 +0,0 @@
|
||||
# Does not run any other workers and only upploads to archive the urls that weren't previously uploaded
|
||||
|
||||
CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||
|
||||
NEWS_FETCH_DEPENDS_ON="[]"
|
||||
DEBUG=false
|
||||
CHECK=false
|
||||
UPLOAD=true
|
||||
HEADLESS=true
|
||||
REDUCEDFETCH=false
|
46
launch
Normal file
46
launch
Normal file
@ -0,0 +1,46 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
set -o ignoreeof
|
||||
|
||||
echo "Bash script launching COSS_ARCHIVING..."
|
||||
|
||||
|
||||
# CHANGE ME ONCE!
|
||||
export CONTAINER_DATA=~/Bulk/COSS/Downloads/coss_archiving
|
||||
export UNAME=remy
|
||||
# CHANGE ME WHEN UPDATING FIREFOX
|
||||
export GECKODRIVER_IMG=selenium/standalone-firefox:104.0
|
||||
# version must be >= than the one on the host or firefox will not start (because of mismatched config)
|
||||
|
||||
if [[ $1 == "debug" ]]
|
||||
then
|
||||
export DEBUG=true
|
||||
export HEADFULL=true
|
||||
export CODE=./
|
||||
export ENTRYPOINT=/bin/bash
|
||||
# since service ports does not open ports on implicitly started containers, also start geckodriver:
|
||||
docker compose up -d geckodriver
|
||||
elif [[ $1 == "production" ]]
|
||||
then
|
||||
export DEBUG=false
|
||||
elif [[ $1 == "build" ]]
|
||||
then
|
||||
export DEBUG=false
|
||||
docker compose build
|
||||
exit 0
|
||||
elif [[ $1 == "down" ]]
|
||||
then
|
||||
docker compose stop
|
||||
exit 0
|
||||
else
|
||||
echo "Please specify the execution mode (debug/production/build) as the first argument"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
shift # consumes the variable set in $1 so that $@ only contains the remaining arguments
|
||||
|
||||
docker compose run -it --service-ports "$@"
|
||||
|
||||
echo "Docker run finished, shutting down containers..."
|
||||
docker compose stop
|
||||
echo "Bye!"
|
@ -1,5 +1,4 @@
|
||||
import sys
|
||||
from webbrowser import get
|
||||
sys.path.append("../app")
|
||||
import runner
|
||||
import logging
|
||||
@ -15,7 +14,7 @@ runner.configuration.models.set_db(
|
||||
runner.configuration.SqliteDatabase("../.dev/media_message_dummy.db"), # chat_db (not needed here)
|
||||
runner.configuration.SqliteDatabase("../.dev/media_downloads.db")
|
||||
)
|
||||
runner.configuration.parsed["DOWNLOADS"]["local_storage_path"] = "../.dev/"
|
||||
runner.configuration.main_config["DOWNLOADS"]["local_storage_path"] = "../.dev/"
|
||||
|
||||
|
||||
def fetch():
|
||||
|
170
misc/migration.to_postgres.py
Normal file
170
misc/migration.to_postgres.py
Normal file
@ -0,0 +1,170 @@
|
||||
import datetime
|
||||
import sys
|
||||
sys.path.append("../news_fetch/")
|
||||
import configuration # lives in app
|
||||
from peewee import *
|
||||
|
||||
import os
|
||||
import time
|
||||
|
||||
old_db = SqliteDatabase("/app/containerdata/downloads.db")
|
||||
|
||||
cred = configuration.db_config["DATABASE"]
|
||||
download_db = PostgresqlDatabase(
|
||||
cred["db_name"], user=cred["user_name"], password=cred["password"], host="vpn", port=5432
|
||||
)
|
||||
|
||||
## OLD Models
|
||||
class OLDModel(Model):
|
||||
class Meta:
|
||||
database = old_db
|
||||
|
||||
|
||||
class OLDArticleDownload(OLDModel):
|
||||
class Meta:
|
||||
db_table = 'articledownload'
|
||||
|
||||
title = CharField(default='')
|
||||
pub_date = DateField(default = '')
|
||||
download_date = DateField(default = 0)
|
||||
source_name = CharField(default = '')
|
||||
article_url = TextField(default = '', unique=True)
|
||||
archive_url = TextField(default = '')
|
||||
file_name = TextField(default = '')
|
||||
language = CharField(default = '')
|
||||
summary = TextField(default = '')
|
||||
comment = TextField(default = '')
|
||||
verified = IntegerField(default = False)
|
||||
# authors
|
||||
# keywords
|
||||
# ... are added through foreignkeys
|
||||
|
||||
|
||||
|
||||
|
||||
class OLDArticleAuthor(OLDModel):
|
||||
class Meta:
|
||||
db_table = 'articleauthor'
|
||||
|
||||
article = ForeignKeyField(OLDArticleDownload, backref='authors')
|
||||
author = CharField()
|
||||
|
||||
|
||||
|
||||
class OLDArticleRelated(OLDModel):
|
||||
class Meta:
|
||||
db_table = 'articlerelated'
|
||||
|
||||
article = ForeignKeyField(OLDArticleDownload, backref='related')
|
||||
related_file_name = TextField(default = '')
|
||||
|
||||
|
||||
|
||||
|
||||
## NEW Models
|
||||
class NEWModel(Model):
|
||||
class Meta:
|
||||
database = download_db
|
||||
|
||||
|
||||
class ArticleDownload(NEWModel):
|
||||
# in the beginning this is all we have
|
||||
article_url = TextField(default = '', unique=True)
|
||||
# fetch then fills in the metadata
|
||||
title = TextField(default='')
|
||||
summary = TextField(default = '')
|
||||
source_name = CharField(default = '')
|
||||
language = CharField(default = '')
|
||||
file_name = TextField(default = '')
|
||||
archive_url = TextField(default = '')
|
||||
pub_date = DateField(default = '')
|
||||
download_date = DateField(default = 0)
|
||||
slack_ts = FloatField(default = 0) # should be a fixed-length string but float is easier to sort by
|
||||
sent = BooleanField(default = False)
|
||||
archived_by = CharField(default = os.getenv("UNAME"))
|
||||
# need to know who saved the message because the file needs to be on their computer in order to get verified
|
||||
# verification happens in a different app, but the model has the fields here as well
|
||||
comment = TextField(default = '')
|
||||
verified = IntegerField(default = 0) # 0 = not verified, 1 = verified, -1 = marked as bad
|
||||
|
||||
def set_authors(self, authors):
|
||||
for a in authors:
|
||||
if len(a) < 100:
|
||||
ArticleAuthor.create(
|
||||
article = self,
|
||||
author = a
|
||||
)
|
||||
|
||||
def set_related(self, related):
|
||||
for r in related:
|
||||
ArticleRelated.create(
|
||||
article = self,
|
||||
related_file_name = r
|
||||
)
|
||||
|
||||
# authors
|
||||
# keywords
|
||||
# ... are added through foreignkeys
|
||||
# we will also add an attribute named message, to reference which message should be replied to. This attribute does not need to be saved in the db
|
||||
|
||||
|
||||
|
||||
class ArticleAuthor(NEWModel):
|
||||
article = ForeignKeyField(ArticleDownload, backref='authors')
|
||||
author = CharField()
|
||||
|
||||
|
||||
class ArticleRelated(NEWModel):
|
||||
# Related files, such as the full text of a paper, audio files, etc.
|
||||
article = ForeignKeyField(ArticleDownload, backref='related')
|
||||
related_file_name = TextField(default = '')
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
####################################################################
|
||||
# Migrate using sensible defaults:
|
||||
download_db.create_tables([ArticleDownload, ArticleAuthor, ArticleRelated])
|
||||
|
||||
it = 0
|
||||
for old_art in OLDArticleDownload.select():
|
||||
print("====================================================================")
|
||||
it+=1
|
||||
print(f"IT {it} New article with data:")
|
||||
print(
|
||||
old_art.article_url,
|
||||
old_art.title,
|
||||
old_art.summary,
|
||||
old_art.source_name,
|
||||
old_art.language,
|
||||
old_art.file_name,
|
||||
old_art.archive_url,
|
||||
old_art.pub_date if old_art.pub_date != "" else datetime.date.fromtimestamp(0),
|
||||
old_art.download_date,
|
||||
True,
|
||||
old_art.comment,
|
||||
old_art.verified
|
||||
)
|
||||
new_art = ArticleDownload.create(
|
||||
article_url = old_art.article_url,
|
||||
title = old_art.title,
|
||||
summary = old_art.summary,
|
||||
source_name = old_art.source_name,
|
||||
language = old_art.language,
|
||||
file_name = old_art.file_name,
|
||||
archive_url = old_art.archive_url,
|
||||
pub_date = old_art.pub_date if old_art.pub_date != "" else datetime.date.fromtimestamp(0),
|
||||
download_date = old_art.download_date,
|
||||
# slack_ts = FloatField(default = 0)
|
||||
sent = True,
|
||||
# archived_by = CharField(default = os.getenv("UNAME"))
|
||||
comment = old_art.comment,
|
||||
verified = old_art.verified
|
||||
)
|
||||
|
||||
|
||||
new_art.set_related([r.related_file_name for r in old_art.related])
|
||||
new_art.set_authors([a.author for a in old_art.authors])
|
||||
|
12
misc/sample_config/nas_sync.config
Normal file
12
misc/sample_config/nas_sync.config
Normal file
@ -0,0 +1,12 @@
|
||||
settings {
|
||||
logfile = "/tmp/lsyncd.log",
|
||||
statusFile = "/tmp/lsyncd.status",
|
||||
nodaemon = true,
|
||||
}
|
||||
|
||||
sync {
|
||||
default.rsync,
|
||||
source = "/sync/local_files",
|
||||
target = "/sync/remote_files",
|
||||
init = false,
|
||||
}
|
25
news_check/Dockerfile
Normal file
25
news_check/Dockerfile
Normal file
@ -0,0 +1,25 @@
|
||||
FROM node:18.8 as build-deps
|
||||
|
||||
WORKDIR /app/client
|
||||
COPY client/package.json ./
|
||||
COPY client/package-lock.json ./
|
||||
COPY client/rollup.config.js ./
|
||||
COPY client/src ./src/
|
||||
RUN npm install
|
||||
RUN npm run build
|
||||
|
||||
|
||||
FROM python:latest
|
||||
ENV TZ Europe/Zurich
|
||||
|
||||
WORKDIR /app/news_check
|
||||
|
||||
COPY requirements.txt requirements.txt
|
||||
RUN python3 -m pip install -r requirements.txt
|
||||
|
||||
COPY client/public/index.html client/public/index.html
|
||||
COPY --from=build-deps /app/client/public client/public/
|
||||
COPY server server/
|
||||
|
||||
WORKDIR /app/news_check/server
|
||||
# CMD python app.py
|
4
news_check/client/.gitignore
vendored
Normal file
4
news_check/client/.gitignore
vendored
Normal file
@ -0,0 +1,4 @@
|
||||
/node_modules/
|
||||
/public/build/
|
||||
|
||||
.DS_Store
|
107
news_check/client/README.md
Normal file
107
news_check/client/README.md
Normal file
@ -0,0 +1,107 @@
|
||||
# This repo is no longer maintained. Consider using `npm init vite` and selecting the `svelte` option or — if you want a full-fledged app framework and don't mind using pre-1.0 software — use [SvelteKit](https://kit.svelte.dev), the official application framework for Svelte.
|
||||
|
||||
---
|
||||
|
||||
# svelte app
|
||||
|
||||
This is a project template for [Svelte](https://svelte.dev) apps. It lives at https://github.com/sveltejs/template.
|
||||
|
||||
To create a new project based on this template using [degit](https://github.com/Rich-Harris/degit):
|
||||
|
||||
```bash
|
||||
npx degit sveltejs/template svelte-app
|
||||
cd svelte-app
|
||||
```
|
||||
|
||||
*Note that you will need to have [Node.js](https://nodejs.org) installed.*
|
||||
|
||||
|
||||
## Get started
|
||||
|
||||
Install the dependencies...
|
||||
|
||||
```bash
|
||||
cd svelte-app
|
||||
npm install
|
||||
```
|
||||
|
||||
...then start [Rollup](https://rollupjs.org):
|
||||
|
||||
```bash
|
||||
npm run dev
|
||||
```
|
||||
|
||||
Navigate to [localhost:8080](http://localhost:8080). You should see your app running. Edit a component file in `src`, save it, and reload the page to see your changes.
|
||||
|
||||
By default, the server will only respond to requests from localhost. To allow connections from other computers, edit the `sirv` commands in package.json to include the option `--host 0.0.0.0`.
|
||||
|
||||
If you're using [Visual Studio Code](https://code.visualstudio.com/) we recommend installing the official extension [Svelte for VS Code](https://marketplace.visualstudio.com/items?itemName=svelte.svelte-vscode). If you are using other editors you may need to install a plugin in order to get syntax highlighting and intellisense.
|
||||
|
||||
## Building and running in production mode
|
||||
|
||||
To create an optimised version of the app:
|
||||
|
||||
```bash
|
||||
npm run build
|
||||
```
|
||||
|
||||
You can run the newly built app with `npm run start`. This uses [sirv](https://github.com/lukeed/sirv), which is included in your package.json's `dependencies` so that the app will work when you deploy to platforms like [Heroku](https://heroku.com).
|
||||
|
||||
|
||||
## Single-page app mode
|
||||
|
||||
By default, sirv will only respond to requests that match files in `public`. This is to maximise compatibility with static fileservers, allowing you to deploy your app anywhere.
|
||||
|
||||
If you're building a single-page app (SPA) with multiple routes, sirv needs to be able to respond to requests for *any* path. You can make it so by editing the `"start"` command in package.json:
|
||||
|
||||
```js
|
||||
"start": "sirv public --single"
|
||||
```
|
||||
|
||||
## Using TypeScript
|
||||
|
||||
This template comes with a script to set up a TypeScript development environment, you can run it immediately after cloning the template with:
|
||||
|
||||
```bash
|
||||
node scripts/setupTypeScript.js
|
||||
```
|
||||
|
||||
Or remove the script via:
|
||||
|
||||
```bash
|
||||
rm scripts/setupTypeScript.js
|
||||
```
|
||||
|
||||
If you want to use `baseUrl` or `path` aliases within your `tsconfig`, you need to set up `@rollup/plugin-alias` to tell Rollup to resolve the aliases. For more info, see [this StackOverflow question](https://stackoverflow.com/questions/63427935/setup-tsconfig-path-in-svelte).
|
||||
|
||||
## Deploying to the web
|
||||
|
||||
### With [Vercel](https://vercel.com)
|
||||
|
||||
Install `vercel` if you haven't already:
|
||||
|
||||
```bash
|
||||
npm install -g vercel
|
||||
```
|
||||
|
||||
Then, from within your project folder:
|
||||
|
||||
```bash
|
||||
cd public
|
||||
vercel deploy --name my-project
|
||||
```
|
||||
|
||||
### With [surge](https://surge.sh/)
|
||||
|
||||
Install `surge` if you haven't already:
|
||||
|
||||
```bash
|
||||
npm install -g surge
|
||||
```
|
||||
|
||||
Then, from within your project folder:
|
||||
|
||||
```bash
|
||||
npm run build
|
||||
surge public my-project.surge.sh
|
||||
```
|
1955
news_check/client/package-lock.json
generated
Normal file
1955
news_check/client/package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
23
news_check/client/package.json
Normal file
23
news_check/client/package.json
Normal file
@ -0,0 +1,23 @@
|
||||
{
|
||||
"name": "svelte-app",
|
||||
"version": "1.0.0",
|
||||
"private": true,
|
||||
"scripts": {
|
||||
"build": "rollup -c",
|
||||
"dev": "rollup -c -w",
|
||||
"start": "sirv public --no-clear"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@rollup/plugin-commonjs": "^17.0.0",
|
||||
"@rollup/plugin-node-resolve": "^11.0.0",
|
||||
"rollup": "^2.3.4",
|
||||
"rollup-plugin-css-only": "^3.1.0",
|
||||
"rollup-plugin-livereload": "^2.0.0",
|
||||
"rollup-plugin-svelte": "^7.0.0",
|
||||
"rollup-plugin-terser": "^7.0.0",
|
||||
"svelte": "^3.0.0"
|
||||
},
|
||||
"dependencies": {
|
||||
"sirv-cli": "^2.0.0"
|
||||
}
|
||||
}
|
BIN
news_check/client/public/favicon.png
Normal file
BIN
news_check/client/public/favicon.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 3.1 KiB |
63
news_check/client/public/global.css
Normal file
63
news_check/client/public/global.css
Normal file
@ -0,0 +1,63 @@
|
||||
html, body {
|
||||
position: relative;
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
}
|
||||
|
||||
body {
|
||||
color: #333;
|
||||
margin: 0;
|
||||
padding: 8px;
|
||||
box-sizing: border-box;
|
||||
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue", sans-serif;
|
||||
}
|
||||
|
||||
a {
|
||||
color: rgb(0,100,200);
|
||||
text-decoration: none;
|
||||
}
|
||||
|
||||
a:hover {
|
||||
text-decoration: underline;
|
||||
}
|
||||
|
||||
a:visited {
|
||||
color: rgb(0,80,160);
|
||||
}
|
||||
|
||||
label {
|
||||
display: block;
|
||||
}
|
||||
|
||||
input, button, select, textarea {
|
||||
font-family: inherit;
|
||||
font-size: inherit;
|
||||
-webkit-padding: 0.4em 0;
|
||||
padding: 0.4em;
|
||||
margin: 0 0 0.5em 0;
|
||||
box-sizing: border-box;
|
||||
border: 1px solid #ccc;
|
||||
border-radius: 2px;
|
||||
}
|
||||
|
||||
input:disabled {
|
||||
color: #ccc;
|
||||
}
|
||||
|
||||
button {
|
||||
color: #333;
|
||||
background-color: #f4f4f4;
|
||||
outline: none;
|
||||
}
|
||||
|
||||
button:disabled {
|
||||
color: #999;
|
||||
}
|
||||
|
||||
button:not(:disabled):active {
|
||||
background-color: #ddd;
|
||||
}
|
||||
|
||||
button:focus {
|
||||
border-color: #666;
|
||||
}
|
25
news_check/client/public/index.html
Normal file
25
news_check/client/public/index.html
Normal file
@ -0,0 +1,25 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset='utf-8'>
|
||||
<meta name='viewport' content='width=device-width,initial-scale=1'>
|
||||
|
||||
<title>NEWS CHECK</title>
|
||||
|
||||
<link rel='icon' type='image/png' href='https://ethz.ch/etc/designs/ethz/img/icons/ETH-APP-Icons-Theme-white/192-xxxhpdi.png'>
|
||||
<link rel='stylesheet' href='/build/bundle.css'>
|
||||
|
||||
<script defer src='/build/bundle.js'></script>
|
||||
|
||||
<link href="https://cdn.jsdelivr.net/npm/daisyui@2.24.0/dist/full.css" rel="stylesheet" type="text/css" />
|
||||
<script src="https://cdn.tailwindcss.com"></script>
|
||||
|
||||
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.0.943/pdf.min.js"></script>
|
||||
<html data-theme="light"></html> <!-- Daisy-ui theme -->
|
||||
|
||||
</head>
|
||||
|
||||
|
||||
<body>
|
||||
</body>
|
||||
</html>
|
76
news_check/client/rollup.config.js
Normal file
76
news_check/client/rollup.config.js
Normal file
@ -0,0 +1,76 @@
|
||||
import svelte from 'rollup-plugin-svelte';
|
||||
import commonjs from '@rollup/plugin-commonjs';
|
||||
import resolve from '@rollup/plugin-node-resolve';
|
||||
import livereload from 'rollup-plugin-livereload';
|
||||
import { terser } from 'rollup-plugin-terser';
|
||||
import css from 'rollup-plugin-css-only';
|
||||
|
||||
const production = !process.env.ROLLUP_WATCH;
|
||||
|
||||
function serve() {
|
||||
let server;
|
||||
|
||||
function toExit() {
|
||||
if (server) server.kill(0);
|
||||
}
|
||||
|
||||
return {
|
||||
writeBundle() {
|
||||
if (server) return;
|
||||
server = require('child_process').spawn('npm', ['run', 'start', '--', '--dev'], {
|
||||
stdio: ['ignore', 'inherit', 'inherit'],
|
||||
shell: true
|
||||
});
|
||||
|
||||
process.on('SIGTERM', toExit);
|
||||
process.on('exit', toExit);
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
export default {
|
||||
input: 'src/main.js',
|
||||
output: {
|
||||
sourcemap: true,
|
||||
format: 'iife',
|
||||
name: 'app',
|
||||
file: 'public/build/bundle.js'
|
||||
},
|
||||
plugins: [
|
||||
svelte({
|
||||
compilerOptions: {
|
||||
// enable run-time checks when not in production
|
||||
dev: !production
|
||||
}
|
||||
}),
|
||||
// we'll extract any component CSS out into
|
||||
// a separate file - better for performance
|
||||
css({ output: 'bundle.css' }),
|
||||
|
||||
// If you have external dependencies installed from
|
||||
// npm, you'll most likely need these plugins. In
|
||||
// some cases you'll need additional configuration -
|
||||
// consult the documentation for details:
|
||||
// https://github.com/rollup/plugins/tree/master/packages/commonjs
|
||||
resolve({
|
||||
browser: true,
|
||||
dedupe: ['svelte']
|
||||
}),
|
||||
commonjs(),
|
||||
|
||||
// In dev mode, call `npm run start` once
|
||||
// the bundle has been generated
|
||||
!production && serve(),
|
||||
|
||||
// Watch the `public` directory and refresh the
|
||||
// browser on changes when not in production
|
||||
!production && livereload('public'),
|
||||
|
||||
// If we're building for production (npm run build
|
||||
// instead of npm run dev), minify
|
||||
production && terser()
|
||||
],
|
||||
watch: {
|
||||
clearScreen: false
|
||||
}
|
||||
};
|
121
news_check/client/scripts/setupTypeScript.js
Normal file
121
news_check/client/scripts/setupTypeScript.js
Normal file
@ -0,0 +1,121 @@
|
||||
// @ts-check
|
||||
|
||||
/** This script modifies the project to support TS code in .svelte files like:
|
||||
|
||||
<script lang="ts">
|
||||
export let name: string;
|
||||
</script>
|
||||
|
||||
As well as validating the code for CI.
|
||||
*/
|
||||
|
||||
/** To work on this script:
|
||||
rm -rf test-template template && git clone sveltejs/template test-template && node scripts/setupTypeScript.js test-template
|
||||
*/
|
||||
|
||||
const fs = require("fs")
|
||||
const path = require("path")
|
||||
const { argv } = require("process")
|
||||
|
||||
const projectRoot = argv[2] || path.join(__dirname, "..")
|
||||
|
||||
// Add deps to pkg.json
|
||||
const packageJSON = JSON.parse(fs.readFileSync(path.join(projectRoot, "package.json"), "utf8"))
|
||||
packageJSON.devDependencies = Object.assign(packageJSON.devDependencies, {
|
||||
"svelte-check": "^2.0.0",
|
||||
"svelte-preprocess": "^4.0.0",
|
||||
"@rollup/plugin-typescript": "^8.0.0",
|
||||
"typescript": "^4.0.0",
|
||||
"tslib": "^2.0.0",
|
||||
"@tsconfig/svelte": "^2.0.0"
|
||||
})
|
||||
|
||||
// Add script for checking
|
||||
packageJSON.scripts = Object.assign(packageJSON.scripts, {
|
||||
"check": "svelte-check --tsconfig ./tsconfig.json"
|
||||
})
|
||||
|
||||
// Write the package JSON
|
||||
fs.writeFileSync(path.join(projectRoot, "package.json"), JSON.stringify(packageJSON, null, " "))
|
||||
|
||||
// mv src/main.js to main.ts - note, we need to edit rollup.config.js for this too
|
||||
const beforeMainJSPath = path.join(projectRoot, "src", "main.js")
|
||||
const afterMainTSPath = path.join(projectRoot, "src", "main.ts")
|
||||
fs.renameSync(beforeMainJSPath, afterMainTSPath)
|
||||
|
||||
// Switch the app.svelte file to use TS
|
||||
const appSveltePath = path.join(projectRoot, "src", "App.svelte")
|
||||
let appFile = fs.readFileSync(appSveltePath, "utf8")
|
||||
appFile = appFile.replace("<script>", '<script lang="ts">')
|
||||
appFile = appFile.replace("export let name;", 'export let name: string;')
|
||||
fs.writeFileSync(appSveltePath, appFile)
|
||||
|
||||
// Edit rollup config
|
||||
const rollupConfigPath = path.join(projectRoot, "rollup.config.js")
|
||||
let rollupConfig = fs.readFileSync(rollupConfigPath, "utf8")
|
||||
|
||||
// Edit imports
|
||||
rollupConfig = rollupConfig.replace(`'rollup-plugin-terser';`, `'rollup-plugin-terser';
|
||||
import sveltePreprocess from 'svelte-preprocess';
|
||||
import typescript from '@rollup/plugin-typescript';`)
|
||||
|
||||
// Replace name of entry point
|
||||
rollupConfig = rollupConfig.replace(`'src/main.js'`, `'src/main.ts'`)
|
||||
|
||||
// Add preprocessor
|
||||
rollupConfig = rollupConfig.replace(
|
||||
'compilerOptions:',
|
||||
'preprocess: sveltePreprocess({ sourceMap: !production }),\n\t\t\tcompilerOptions:'
|
||||
);
|
||||
|
||||
// Add TypeScript
|
||||
rollupConfig = rollupConfig.replace(
|
||||
'commonjs(),',
|
||||
'commonjs(),\n\t\ttypescript({\n\t\t\tsourceMap: !production,\n\t\t\tinlineSources: !production\n\t\t}),'
|
||||
);
|
||||
fs.writeFileSync(rollupConfigPath, rollupConfig)
|
||||
|
||||
// Add TSConfig
|
||||
const tsconfig = `{
|
||||
"extends": "@tsconfig/svelte/tsconfig.json",
|
||||
|
||||
"include": ["src/**/*"],
|
||||
"exclude": ["node_modules/*", "__sapper__/*", "public/*"]
|
||||
}`
|
||||
const tsconfigPath = path.join(projectRoot, "tsconfig.json")
|
||||
fs.writeFileSync(tsconfigPath, tsconfig)
|
||||
|
||||
// Add global.d.ts
|
||||
const dtsPath = path.join(projectRoot, "src", "global.d.ts")
|
||||
fs.writeFileSync(dtsPath, `/// <reference types="svelte" />`)
|
||||
|
||||
// Delete this script, but not during testing
|
||||
if (!argv[2]) {
|
||||
// Remove the script
|
||||
fs.unlinkSync(path.join(__filename))
|
||||
|
||||
// Check for Mac's DS_store file, and if it's the only one left remove it
|
||||
const remainingFiles = fs.readdirSync(path.join(__dirname))
|
||||
if (remainingFiles.length === 1 && remainingFiles[0] === '.DS_store') {
|
||||
fs.unlinkSync(path.join(__dirname, '.DS_store'))
|
||||
}
|
||||
|
||||
// Check if the scripts folder is empty
|
||||
if (fs.readdirSync(path.join(__dirname)).length === 0) {
|
||||
// Remove the scripts folder
|
||||
fs.rmdirSync(path.join(__dirname))
|
||||
}
|
||||
}
|
||||
|
||||
// Adds the extension recommendation
|
||||
fs.mkdirSync(path.join(projectRoot, ".vscode"), { recursive: true })
|
||||
fs.writeFileSync(path.join(projectRoot, ".vscode", "extensions.json"), `{
|
||||
"recommendations": ["svelte.svelte-vscode"]
|
||||
}
|
||||
`)
|
||||
|
||||
console.log("Converted to TypeScript.")
|
||||
|
||||
if (fs.existsSync(path.join(projectRoot, "node_modules"))) {
|
||||
console.log("\nYou will need to re-run your dependency manager to get started.")
|
||||
}
|
39
news_check/client/src/App.svelte
Normal file
39
news_check/client/src/App.svelte
Normal file
@ -0,0 +1,39 @@
|
||||
<script>
|
||||
import PDFView from './PDFView.svelte';
|
||||
import ArticleStatus from './ArticleStatus.svelte';
|
||||
import ArticleOperations from './ArticleOperations.svelte';
|
||||
|
||||
let current_id = 0;
|
||||
|
||||
const updateInterface = (async () => {
|
||||
let url = '';
|
||||
if (current_id == 0) {
|
||||
url = '/api/article/first';
|
||||
} else {
|
||||
url = '/api/article/' + current_id + '/next';
|
||||
}
|
||||
const response = await fetch(url)
|
||||
const data = await response.json()
|
||||
current_id = data.id;
|
||||
let article_url = '/api/article/' + current_id + '/get';
|
||||
const article_response = await fetch(article_url);
|
||||
const article_data = await article_response.json();
|
||||
return article_data;
|
||||
})()
|
||||
|
||||
|
||||
</script>
|
||||
|
||||
{#await updateInterface}
|
||||
...
|
||||
{:then article_data}
|
||||
<div class="flex w-full h-screen gap-5 p-5">
|
||||
<div class="w-3/5"><PDFView article_data={article_data}/></div>
|
||||
<div class="divider divider-horizontal"></div>
|
||||
<div class="w-2/5">
|
||||
<ArticleStatus article_data={article_data}/>
|
||||
<div class="divider divider-vertical"></div>
|
||||
<ArticleOperations article_data={article_data}/>
|
||||
</div>
|
||||
</div>
|
||||
{/await}
|
93
news_check/client/src/ArticleOperations.svelte
Normal file
93
news_check/client/src/ArticleOperations.svelte
Normal file
@ -0,0 +1,93 @@
|
||||
<script>
|
||||
import {fade} from 'svelte/transition';
|
||||
|
||||
export let article_data;
|
||||
|
||||
const actions = [
|
||||
{name: 'Mark as good (and skip to next)', kbd: 'A'},
|
||||
{name: 'Mark as bad (and skip to next)', kbd: 'B'},
|
||||
{name: 'Upload related file', kbd: 'R'},
|
||||
{name: 'Skip', kbd: 'ctrl'},
|
||||
]
|
||||
|
||||
const toast_states = {
|
||||
'success' : {class: 'alert-success', text: 'Article updated successfully'},
|
||||
'error' : {class: 'alert-error', text: 'Article update failed'},
|
||||
}
|
||||
let toast_state = {};
|
||||
let toast_visible = false;
|
||||
|
||||
|
||||
function onKeyDown(e) {apiAction(e.key)}
|
||||
function apiAction(key) {
|
||||
if (actions.map(d => d.kbd.toLowerCase()).includes(key.toLowerCase())){ // ignore other keypresses
|
||||
|
||||
const updateArticle = (async() => {
|
||||
const response = await fetch('/api/article/' + article_data.id + '/set', {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify({
|
||||
'action': key.toLowerCase(),
|
||||
})
|
||||
})
|
||||
const success = response.status == 200;
|
||||
|
||||
if (success){
|
||||
showToast('success');
|
||||
} else {
|
||||
showToast('error');
|
||||
}
|
||||
|
||||
})()
|
||||
}
|
||||
}
|
||||
|
||||
function showToast(state){
|
||||
toast_visible = true;
|
||||
toast_state = toast_states[state];
|
||||
setTimeout(() => {
|
||||
toast_visible = false;
|
||||
}, 1000)
|
||||
|
||||
}
|
||||
</script>
|
||||
|
||||
|
||||
<div class="card bg-neutral-300 shadow-xl">
|
||||
<div class="card-body">
|
||||
<h2 class="card-title">Your options: (click on action or use keyboard)</h2>
|
||||
<div class="overflow-x-auto">
|
||||
<table class="table w-full table-compact">
|
||||
<!-- head -->
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Action</th>
|
||||
<th>Keyboard shortcut</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{#each actions as action}
|
||||
|
||||
<tr>
|
||||
<td><button on:click={() => apiAction(action.kbd)}>{ action.name }</button></td>
|
||||
<td><kbd class="kbd">{ action.kbd }</kbd></td>
|
||||
</tr>
|
||||
|
||||
{/each}
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<svelte:window on:keydown|preventDefault={onKeyDown} />
|
||||
|
||||
{#if toast_visible}
|
||||
<div class="toast" transition:fade>
|
||||
<div class="alert { toast_state.class }">
|
||||
<div>
|
||||
<span>{ toast_state.text }.</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
{/if}
|
38
news_check/client/src/ArticleStatus.svelte
Normal file
38
news_check/client/src/ArticleStatus.svelte
Normal file
@ -0,0 +1,38 @@
|
||||
<script>
|
||||
export let article_data;
|
||||
const status_items = [
|
||||
{name: 'Title', value: article_data.title},
|
||||
{name: 'Filename', value: article_data.file_name},
|
||||
{name: 'Language', value: article_data.language},
|
||||
{name: 'Authors', value: article_data.authors},
|
||||
{name: "Related", value: article_data.related},
|
||||
]
|
||||
</script>
|
||||
|
||||
<div class="card bg-neutral-300 shadow-xl overflow-x-auto">
|
||||
<div class="card-body">
|
||||
<h2 class="card-title">Article overview:</h2>
|
||||
<table class="table w-full table-compact" style="table-layout: fixed">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Attribute</th>
|
||||
<th>Value</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{#each status_items as item}
|
||||
<tr>
|
||||
<td>{ item.name }</td>
|
||||
<!-- <td>Quality Control Specialist</td> -->
|
||||
{#if item.value != ""}
|
||||
<td class='bg-emerald-200' style="white-space: normal; width:70%">{ item.value }</td>
|
||||
{:else}
|
||||
<td class='bg-red-200'>{ item.value }</td>
|
||||
{/if}
|
||||
</tr>
|
||||
{/each}
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
</div>
|
15
news_check/client/src/PDFView.svelte
Normal file
15
news_check/client/src/PDFView.svelte
Normal file
@ -0,0 +1,15 @@
|
||||
|
||||
<script>
|
||||
export let article_data;
|
||||
</script>
|
||||
|
||||
<div class="h-full w-full shadow-xl">
|
||||
<object class="pdf-view" data="{article_data.save_path + article_data.file_name}" title="Article PDF"> </object>
|
||||
</div>
|
||||
|
||||
<style>
|
||||
.pdf-view {
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
}
|
||||
</style>
|
10
news_check/client/src/main.js
Normal file
10
news_check/client/src/main.js
Normal file
@ -0,0 +1,10 @@
|
||||
import App from './App.svelte';
|
||||
|
||||
const app = new App({
|
||||
target: document.body,
|
||||
props: {
|
||||
name: 'world'
|
||||
}
|
||||
});
|
||||
|
||||
export default app;
|
4
news_check/requirements.txt
Normal file
4
news_check/requirements.txt
Normal file
@ -0,0 +1,4 @@
|
||||
flask
|
||||
peewee
|
||||
markdown
|
||||
psycopg2
|
67
news_check/server/app.py
Normal file
67
news_check/server/app.py
Normal file
@ -0,0 +1,67 @@
|
||||
from flask import Flask, send_from_directory, request
|
||||
import configuration
|
||||
models = configuration.models
|
||||
db = configuration.db
|
||||
app = Flask(__name__)
|
||||
|
||||
|
||||
###############################################################################
|
||||
# SVELTE 'STATIC' BACKEND. Always send index.html and the requested js-files. (compiled by npm)
|
||||
|
||||
@app.route("/") #index.html
|
||||
def index():
|
||||
return send_from_directory('../client/public', 'index.html')
|
||||
@app.route("/<path:path>") #js-files
|
||||
def js(path):
|
||||
return send_from_directory('../client/public', path)
|
||||
@app.route("/app/containerdata/files/<path:path>")
|
||||
def static_pdf(path):
|
||||
return send_from_directory('/app/containerdata/files/', path)
|
||||
|
||||
|
||||
|
||||
###############################################################################
|
||||
# (simple) API for news_check.
|
||||
|
||||
@app.route("/api/article/<int:id>/get")
|
||||
def get_article_by_id(id):
|
||||
with db:
|
||||
article = models.ArticleDownload.get_by_id(id)
|
||||
return article.to_dict()
|
||||
|
||||
@app.route("/api/article/first")
|
||||
def get_article_first():
|
||||
with db:
|
||||
article = models.ArticleDownload.select(models.ArticleDownload.id).where(models.ArticleDownload.verified == 0).order_by(models.ArticleDownload.id).first()
|
||||
return {"id" : article.id}
|
||||
|
||||
@app.route("/api/article/<int:id>/next")
|
||||
def get_article_next(id):
|
||||
with db:
|
||||
if models.ArticleDownload.get_by_id(id + 1).verified == 0:
|
||||
return {"id" : id + 1}
|
||||
else:
|
||||
return get_article_first()
|
||||
|
||||
|
||||
|
||||
@app.route("/api/article/<int:id>/set", methods=['POST'])
|
||||
def set_article(id):
|
||||
action = request.json['action']
|
||||
with db:
|
||||
article = models.ArticleDownload.get_by_id(id)
|
||||
if action == "a":
|
||||
article.verified = 1
|
||||
elif action == "b":
|
||||
article.verified = -1
|
||||
elif action == "r":
|
||||
article.set_related()
|
||||
article.save()
|
||||
return "ok"
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
app.run(host="0.0.0.0", port="80")
|
16
news_check/server/configuration.py
Normal file
16
news_check/server/configuration.py
Normal file
@ -0,0 +1,16 @@
|
||||
from peewee import PostgresqlDatabase
|
||||
import configparser
|
||||
|
||||
main_config = configparser.ConfigParser()
|
||||
main_config.read("/app/containerdata/config/news_fetch.config.ini")
|
||||
|
||||
db_config = configparser.ConfigParser()
|
||||
db_config.read("/app/containerdata/config/db.config.ini")
|
||||
|
||||
cred = db_config["DATABASE"]
|
||||
db = PostgresqlDatabase(
|
||||
cred["db_name"], user=cred["user_name"], password=cred["password"], host="vpn", port=5432
|
||||
)
|
||||
|
||||
import models
|
||||
models.set_db(db)
|
134
news_check/server/models.py
Normal file
134
news_check/server/models.py
Normal file
@ -0,0 +1,134 @@
|
||||
import logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
from peewee import *
|
||||
import os
|
||||
import datetime
|
||||
import configuration
|
||||
|
||||
config = configuration.main_config["DOWNLOADS"]
|
||||
|
||||
# set the nature of the db at runtime
|
||||
download_db = DatabaseProxy()
|
||||
|
||||
|
||||
class DownloadBaseModel(Model):
|
||||
class Meta:
|
||||
database = download_db
|
||||
|
||||
|
||||
|
||||
## == Article related models == ##
|
||||
class ArticleDownload(DownloadBaseModel):
|
||||
# in the beginning this is all we have
|
||||
article_url = TextField(default = '', unique=True)
|
||||
|
||||
# fetch then fills in the metadata
|
||||
title = TextField(default='')
|
||||
|
||||
summary = TextField(default = '')
|
||||
source_name = CharField(default = '')
|
||||
language = CharField(default = '')
|
||||
|
||||
|
||||
file_name = TextField(default = '')
|
||||
@property
|
||||
def save_path(self):
|
||||
return f"{config['local_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/"
|
||||
@property
|
||||
def fname_nas(self, file_name=""):
|
||||
if self.download_date:
|
||||
if file_name:
|
||||
return f"NAS: {config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{file_name}"
|
||||
else: # return the self. name
|
||||
return f"NAS: {config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{self.file_name}"
|
||||
else:
|
||||
return None
|
||||
|
||||
|
||||
archive_url = TextField(default = '')
|
||||
pub_date = DateField(default = datetime.date.fromtimestamp(0))
|
||||
download_date = DateField(default = datetime.date.today)
|
||||
|
||||
slack_ts = FloatField(default = 0) # should be a fixed-length string but float is easier to sort by
|
||||
|
||||
sent = BooleanField(default = False)
|
||||
|
||||
archived_by = CharField(default = os.getenv("UNAME"))
|
||||
# need to know who saved the message because the file needs to be on their computer in order to get verified
|
||||
# verification happens in a different app, but the model has the fields here as well
|
||||
comment = TextField(default = '')
|
||||
verified = IntegerField(default = 0) # 0 = not verified, 1 = verified, -1 = marked as bad
|
||||
|
||||
# authors
|
||||
# keywords
|
||||
# ... are added through foreignkeys
|
||||
# we will also add an attribute named message, to reference which message should be replied to. This attribute does not need to be saved in the db
|
||||
|
||||
def to_dict(self):
|
||||
return {
|
||||
"id": self.id,
|
||||
"article_url": self.article_url,
|
||||
"title": self.title,
|
||||
"summary": self.summary,
|
||||
"source_name": self.source_name,
|
||||
"language": self.language,
|
||||
"file_name": self.file_name,
|
||||
"save_path": self.save_path,
|
||||
"fname_nas": self.fname_nas,
|
||||
"archive_url": self.archive_url,
|
||||
"pub_date": self.pub_date.strftime("%Y-%m-%d"),
|
||||
"download_date": self.download_date.strftime("%Y-%m-%d"),
|
||||
"sent": self.sent,
|
||||
"comment": self.comment,
|
||||
"related": [r.related_file_name for r in self.related],
|
||||
"authors": [a.author for a in self.authors]
|
||||
}
|
||||
|
||||
|
||||
|
||||
def set_related(self, related):
|
||||
for r in related:
|
||||
if len(r) > 255:
|
||||
raise Exception("Related file name too long for POSTGRES")
|
||||
|
||||
ArticleRelated.create(
|
||||
article = self,
|
||||
related_file_name = r
|
||||
)
|
||||
|
||||
def file_status(self):
|
||||
if not self.file_name:
|
||||
logger.error(f"Article {self} has no filename!")
|
||||
return False, {"reply_text": "Download failed, no file was saved.", "file_path": None}
|
||||
|
||||
file_path_abs = self.save_path + self.file_name
|
||||
if not os.path.exists(file_path_abs):
|
||||
logger.error(f"Article {self} has a filename, but the file does not exist at that location!")
|
||||
return False, {"reply_text": "Can't find file. Either the download failed or the file was moved.", "file_path": None}
|
||||
|
||||
return True, {}
|
||||
|
||||
|
||||
class ArticleAuthor(DownloadBaseModel):
|
||||
article = ForeignKeyField(ArticleDownload, backref='authors')
|
||||
author = CharField()
|
||||
|
||||
|
||||
class ArticleRelated(DownloadBaseModel):
|
||||
# Related files, such as the full text of a paper, audio files, etc.
|
||||
article = ForeignKeyField(ArticleDownload, backref='related')
|
||||
related_file_name = TextField(default = '')
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
def set_db(download_db_object):
|
||||
download_db.initialize(download_db_object)
|
||||
with download_db: # create tables (does nothing if they exist already)
|
||||
download_db.create_tables([ArticleDownload, ArticleAuthor, ArticleRelated])
|
||||
|
||||
|
@ -1,2 +1,2 @@
|
||||
.dev/
|
||||
Dockerfile
|
||||
__pycache__/
|
@ -2,26 +2,21 @@ FROM python:latest
|
||||
|
||||
ENV TZ Europe/Zurich
|
||||
|
||||
|
||||
RUN apt-get update && apt-get install -y \
|
||||
evince \
|
||||
# for checking
|
||||
xauth \
|
||||
#for gui
|
||||
ghostscript
|
||||
# for compression
|
||||
|
||||
RUN apt-get update && apt-get install -y ghostscript
|
||||
# for compression of pdfs
|
||||
|
||||
RUN useradd --create-home --shell /bin/bash --uid 1001 autonews
|
||||
# id mapped to local user
|
||||
# home directory needed for pip package installation
|
||||
RUN export PATH=/home/autonews/.local/bin:$PATH
|
||||
|
||||
|
||||
RUN mkdir -p /app/auto_news
|
||||
RUN chown -R autonews:autonews /app
|
||||
USER autonews
|
||||
RUN export PATH=/home/autonews/.local/bin:$PATH
|
||||
|
||||
COPY requirements.txt /app/requirements.txt
|
||||
RUN python3 -m pip install -r /app/requirements.txt
|
||||
|
||||
COPY app /app/auto_news
|
||||
COPY . /app/auto_news
|
||||
WORKDIR /app/auto_news
|
||||
|
@ -1,59 +0,0 @@
|
||||
from dataclasses import dataclass
|
||||
import os
|
||||
import shutil
|
||||
import configparser
|
||||
import logging
|
||||
from datetime import datetime
|
||||
from peewee import SqliteDatabase
|
||||
from rich.logging import RichHandler
|
||||
|
||||
# first things first: logging
|
||||
logging.basicConfig(
|
||||
format='%(message)s',
|
||||
level=logging.INFO,
|
||||
datefmt='%H:%M:%S', # add %Y-%m-%d if needed
|
||||
handlers=[RichHandler()]
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# load config file containing constants and secrets
|
||||
parsed = configparser.ConfigParser()
|
||||
parsed.read("/app/containerdata/config/news_fetch.config.ini")
|
||||
|
||||
if os.getenv("DEBUG", "false") == "true":
|
||||
logger.warning("Found 'DEBUG=true', setting up dummy databases")
|
||||
|
||||
db_base_path = parsed["DATABASE"]["db_path_dev"]
|
||||
parsed["SLACK"]["archive_id"] = parsed["SLACK"]["debug_id"]
|
||||
parsed["MAIL"]["recipient"] = parsed["MAIL"]["sender"]
|
||||
parsed["DOWNLOADS"]["local_storage_path"] = parsed["DATABASE"]["db_path_dev"]
|
||||
else:
|
||||
logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
|
||||
db_base_path = parsed["DATABASE"]["db_path_prod"]
|
||||
logger.info("Backing up databases")
|
||||
backup_dst = parsed["DATABASE"]["db_backup"]
|
||||
today = datetime.today().strftime("%Y.%m.%d")
|
||||
shutil.copyfile(
|
||||
os.path.join(db_base_path, parsed["DATABASE"]["chat_db_name"]),
|
||||
os.path.join(backup_dst, today + "." + parsed["DATABASE"]["chat_db_name"]),
|
||||
)
|
||||
shutil.copyfile(
|
||||
os.path.join(db_base_path, parsed["DATABASE"]["download_db_name"]),
|
||||
os.path.join(backup_dst, today + "." + parsed["DATABASE"]["download_db_name"]),
|
||||
)
|
||||
|
||||
|
||||
from utils_storage import models
|
||||
|
||||
# Set up the database
|
||||
models.set_db(
|
||||
SqliteDatabase(
|
||||
os.path.join(db_base_path, parsed["DATABASE"]["chat_db_name"]),
|
||||
pragmas = {'journal_mode': 'wal'} # mutliple threads can read at once
|
||||
),
|
||||
SqliteDatabase(
|
||||
os.path.join(db_base_path, parsed["DATABASE"]["download_db_name"]),
|
||||
pragmas = {'journal_mode': 'wal'} # mutliple threads can read at once
|
||||
)
|
||||
)
|
@ -1,205 +0,0 @@
|
||||
"""Main coordination of other util classes. Handles inbound and outbound calls"""
|
||||
import configuration
|
||||
models = configuration.models
|
||||
from threading import Thread
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
from utils_mail import runner as mail_runner
|
||||
from utils_slack import runner as slack_runner
|
||||
from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, UploadWorker
|
||||
|
||||
|
||||
class ArticleWatcher:
|
||||
"""Wrapper for a newly created article object. Notifies the coordinator upon change/completition"""
|
||||
def __init__(self, article, thread, **kwargs) -> None:
|
||||
self.article_id = article.id # in case article becomes None at any point, we can still track the article
|
||||
self.article = article
|
||||
self.thread = thread
|
||||
|
||||
self.completition_notifier = kwargs.get("notifier")
|
||||
self.fetch = kwargs.get("worker_fetch", None)
|
||||
self.download = kwargs.get("worker_download", None)
|
||||
self.compress = kwargs.get("worker_compress", None)
|
||||
self.upload = kwargs.get("worker_upload", None)
|
||||
|
||||
self.completition_notified = False
|
||||
# self._download_called = self._compression_called = False
|
||||
self._fetch_completed = self._download_completed = self._compression_completed = self._upload_completed = False
|
||||
|
||||
# first step: gather metadata
|
||||
if self.fetch and self.upload:
|
||||
self.fetch.process(self) # this will call the update_status method
|
||||
self.upload.process(self) # idependent from the rest
|
||||
else: # the full kwargs were not provided, only do a manual run
|
||||
# overwrite update_status() because calls from the workers will result in erros
|
||||
self.update_status = lambda completed: logger.info(f"Completed action {completed}")
|
||||
for w in kwargs.get("workers_manual"):
|
||||
w.process(self)
|
||||
|
||||
|
||||
def update_status(self, completed_action):
|
||||
"""Checks and notifies internal completition-status.
|
||||
Article download is complete iff fetch and download were successfull and compression was run
|
||||
"""
|
||||
# if self.completition_notified and self._compression_completed and self._fetch_completed and self._download_completed and self._upload_completed, we are done
|
||||
if completed_action == "fetch":
|
||||
self.download.process(self)
|
||||
elif completed_action == "download":
|
||||
self.compress.process(self)
|
||||
elif completed_action == "compress": # last step
|
||||
self.completition_notifier(self.article, self.thread)
|
||||
# triggers action in Coordinator
|
||||
elif completed_action == "upload":
|
||||
# this case occurs when upload was faster than compression
|
||||
pass
|
||||
else:
|
||||
logger.warning(f"update_status called with unusual configuration: {completed_action}")
|
||||
|
||||
|
||||
# ====== Attributes to be modified by the util workers
|
||||
@property
|
||||
def fetch_completed(self):
|
||||
return self._fetch_completed
|
||||
|
||||
@fetch_completed.setter
|
||||
def fetch_completed(self, value: bool):
|
||||
self._fetch_completed = value
|
||||
self.update_status("fetch")
|
||||
|
||||
@property
|
||||
def download_completed(self):
|
||||
return self._download_completed
|
||||
|
||||
@download_completed.setter
|
||||
def download_completed(self, value: bool):
|
||||
self._download_completed = value
|
||||
self.update_status("download")
|
||||
|
||||
@property
|
||||
def compression_completed(self):
|
||||
return self._compression_completed
|
||||
|
||||
@compression_completed.setter
|
||||
def compression_completed(self, value: bool):
|
||||
self._compression_completed = value
|
||||
self.update_status("compress")
|
||||
|
||||
@property
|
||||
def upload_completed(self):
|
||||
return self._upload_completed
|
||||
|
||||
@upload_completed.setter
|
||||
def upload_completed(self, value: bool):
|
||||
self._upload_completed = value
|
||||
self.update_status("upload")
|
||||
|
||||
def __str__(self) -> str:
|
||||
return f"Article with id {self.article_id}"
|
||||
|
||||
|
||||
class Coordinator(Thread):
|
||||
def __init__(self, **kwargs) -> None:
|
||||
"""Launcher calls this Coordinator as the main thread to handle connections between the other workers (threaded)."""
|
||||
super().__init__(target = self.launch, daemon=True)
|
||||
|
||||
def add_workers(self, **kwargs):
|
||||
self.worker_slack = kwargs.pop("worker_slack", None)
|
||||
self.worker_mail = kwargs.pop("worker_mail", None)
|
||||
# the two above won't be needed in the Watcher
|
||||
self.worker_download = kwargs.get("worker_download", None)
|
||||
self.worker_fetch = kwargs.get("worker_fetch", None)
|
||||
self.worker_compress = kwargs.get("worker_compress", None)
|
||||
self.worker_upload = kwargs.get("worker_upload", None)
|
||||
|
||||
self.kwargs = kwargs
|
||||
|
||||
def launch(self) -> None:
|
||||
for w in [self.worker_download, self.worker_fetch, self.worker_upload, self.worker_compress]:
|
||||
if not w is None:
|
||||
w.start()
|
||||
|
||||
|
||||
def incoming_request(self, message):
|
||||
"""This method is passed onto the slack worker. It gets triggered when a new message is received."""
|
||||
url = message.urls[0] # ignore all the other ones
|
||||
article, is_new = models.ArticleDownload.get_or_create(article_url=url)
|
||||
thread = message.thread
|
||||
thread.article = article
|
||||
thread.save()
|
||||
self.kwargs.update({"notifier" : self.article_complete_notifier})
|
||||
|
||||
if is_new or (article.file_name == "" and article.verified == 0):
|
||||
# check for models that were created but were abandonned. This means they have missing information, most importantly no associated file
|
||||
# this overwrites previously set information, but that should not be too important
|
||||
ArticleWatcher(
|
||||
article,
|
||||
thread,
|
||||
**self.kwargs
|
||||
)
|
||||
|
||||
# All workers are implemented as a threaded queue. But the individual model requires a specific processing order:
|
||||
# fetch -> download -> compress -> complete
|
||||
# the watcher orchestrates the procedure and notifies upon completition
|
||||
# the watcher will notify once it is sufficiently populated
|
||||
else: # manually trigger notification immediatly
|
||||
logger.info(f"Found existing article {article}. Now sending")
|
||||
self.article_complete_notifier(article, thread)
|
||||
|
||||
|
||||
|
||||
def manual_processing(self, articles, workers):
|
||||
for w in workers:
|
||||
w.start()
|
||||
|
||||
for article in articles:
|
||||
notifier = lambda article: print(f"Completed manual actions for {article}")
|
||||
ArticleWatcher(article, None, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg
|
||||
|
||||
def article_complete_notifier(self, article, thread):
|
||||
if self.worker_slack is None:
|
||||
logger.warning("Not sending slack notifier")
|
||||
else:
|
||||
self.worker_slack.bot_worker.respond_channel_message(thread)
|
||||
if self.worker_mail is None:
|
||||
logger.warning("Not sending mail notifier")
|
||||
else:
|
||||
self.worker_mail.send(article)
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
coordinator = Coordinator()
|
||||
|
||||
|
||||
if os.getenv("UPLOAD", "false") == "true":
|
||||
articles = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "").execute()
|
||||
logger.info(f"Launching upload to archive for {len(articles)} articles.")
|
||||
coordinator.manual_processing(articles, [UploadWorker()])
|
||||
|
||||
elif os.getenv("CHECK", "false") == "true":
|
||||
from utils_check import runner as check_runner
|
||||
check_runner.verify_unchecked()
|
||||
|
||||
else: # launch with full action
|
||||
slack_runner = slack_runner.BotRunner(coordinator.incoming_request)
|
||||
kwargs = {
|
||||
"worker_download" : DownloadWorker(),
|
||||
"worker_fetch" : FetchWorker(),
|
||||
"worker_upload" : UploadWorker(),
|
||||
"worker_compress" : CompressWorker(),
|
||||
"worker_slack" : slack_runner,
|
||||
"worker_mail" : mail_runner,
|
||||
}
|
||||
try:
|
||||
coordinator.add_workers(**kwargs)
|
||||
coordinator.start()
|
||||
slack_runner.start()
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Keyboard interrupt. Stopping Slack and Coordinator")
|
||||
slack_runner.stop()
|
||||
print("BYE!")
|
||||
# coordinator was set as a daemon thread, so it will be stopped automatically
|
||||
sys.exit(0)
|
@ -1,285 +0,0 @@
|
||||
import logging
|
||||
import configuration
|
||||
import requests
|
||||
import os
|
||||
import time
|
||||
from threading import Thread
|
||||
from slack_sdk.errors import SlackApiError
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
config = configuration.parsed["SLACK"]
|
||||
models = configuration.models
|
||||
slack_client = "dummy"
|
||||
LATEST_RECORDED_REACTION = 0
|
||||
|
||||
|
||||
def init(client) -> None:
|
||||
"""Starts fetching past messages and returns the freshly launched thread"""
|
||||
global slack_client
|
||||
slack_client = client
|
||||
|
||||
global LATEST_RECORDED_REACTION
|
||||
try:
|
||||
LATEST_RECORDED_REACTION = models.Reaction.select(models.Reaction.id).order_by("id")[-1]
|
||||
except IndexError: #query is actually empty, we have never fetched any messages until now
|
||||
LATEST_RECORDED_REACTION = 0
|
||||
|
||||
# fetch all te messages we could have possibly missed
|
||||
logger.info("Querying missed messages, threads and reactions. This can take some time.")
|
||||
fetch_missed_channel_messages() # not threaded
|
||||
t = Thread(target = fetch_missed_channel_reactions, daemon=True) # threaded, runs in background (usually takes a long time)
|
||||
t.start()
|
||||
|
||||
if os.getenv("REDUCEDFETCH", "false") == "true":
|
||||
logger.warning("Only fetching empty threads for bot messages because 'REDUCEDFETCH=true'")
|
||||
fetch_missed_thread_messages(reduced=True)
|
||||
else: # perform both asyncronously
|
||||
fetch_missed_thread_messages()
|
||||
|
||||
|
||||
|
||||
def get_unhandled_messages():
|
||||
"""Gets all messages that have not yet been handled, be it by mistake or by downtime
|
||||
As the message handler makes no distinction between channel messages and thread messages,
|
||||
we don't have to worry about them here.
|
||||
"""
|
||||
|
||||
threaded_objects = []
|
||||
for t in models.Thread.select():
|
||||
if t.message_count > 1: # if only one message was written, it is the channel message
|
||||
msg = t.last_message
|
||||
if msg.is_by_human:
|
||||
threaded_objects.append(msg)
|
||||
# else don't, nothing to process
|
||||
logger.info(f"Set {len(threaded_objects)} thread-messages as not yet handled.")
|
||||
|
||||
|
||||
channel_objects = [t.initiator_message for t in models.Thread.select() if (t.message_count == 1 and not t.is_fully_processed)]
|
||||
logger.info(f"Set {len(channel_objects)} channel-messages as not yet handled.")
|
||||
|
||||
reaction_objects = list(models.Reaction.select().where(models.Reaction.id > LATEST_RECORDED_REACTION))
|
||||
logger.info(f"Set {len(reaction_objects)} reactions as not yet handled.")
|
||||
# the ones newer than the last before the fetch
|
||||
|
||||
all_messages = channel_objects + threaded_objects
|
||||
return all_messages, reaction_objects
|
||||
|
||||
|
||||
def fetch_missed_channel_messages():
|
||||
# latest processed message_ts is:
|
||||
presaved = models.Message.select().order_by(models.Message.ts)
|
||||
if not presaved:
|
||||
last_ts = 0
|
||||
else:
|
||||
last_message = presaved[-1]
|
||||
last_ts = last_message.slack_ts
|
||||
|
||||
result = slack_client.conversations_history(
|
||||
channel=config["archive_id"],
|
||||
oldest=last_ts
|
||||
)
|
||||
|
||||
new_messages = result.get("messages", [])
|
||||
# # filter the last one, it is a duplicate! (only if the db is not empty!)
|
||||
# if last_ts != 0 and len(new_messages) != 0:
|
||||
# new_messages.pop(-1)
|
||||
|
||||
new_fetches = 0
|
||||
for m in new_messages:
|
||||
# print(m)
|
||||
message_dict_to_model(m)
|
||||
new_fetches += 1
|
||||
|
||||
refetch = result.get("has_more", False)
|
||||
while refetch: # we have not actually fetched them all
|
||||
try:
|
||||
result = slack_client.conversations_history(
|
||||
channel = config["archive_id"],
|
||||
cursor = result["response_metadata"]["next_cursor"],
|
||||
oldest = last_ts
|
||||
) # fetches 100 messages, older than the [-1](=oldest) element of new_fetches
|
||||
refetch = result.get("has_more", False)
|
||||
|
||||
new_messages = result.get("messages", [])
|
||||
for m in new_messages:
|
||||
message_dict_to_model(m)
|
||||
new_fetches += 1
|
||||
except SlackApiError: # Most likely a rate-limit
|
||||
logger.error("Error while fetching channel messages. (likely rate limit) Retrying in {} seconds...".format(config["api_wait_time"]))
|
||||
time.sleep(config["api_wait_time"])
|
||||
refetch = True
|
||||
|
||||
logger.info(f"Fetched {new_fetches} new channel messages.")
|
||||
|
||||
|
||||
def fetch_missed_thread_messages(reduced=False):
|
||||
"""After having gotten all base-threads, we need to fetch all their replies"""
|
||||
# I don't know of a better way: we need to fetch this for each and every thread (except if it is marked as permanently solved)
|
||||
logger.info("Starting fetch of thread messages...")
|
||||
if reduced:
|
||||
threads = [t for t in models.Thread.select() if (t.message_count == 1 and not t.is_fully_processed)]
|
||||
# this only fetches completely empty threads, which might be because the bot-message was not yet saved to the db.
|
||||
# once we got all the bot-messages the remaining empty threads will be the ones we need to process.
|
||||
else:
|
||||
threads = [t for t in models.Thread.select() if not t.is_fully_processed]
|
||||
logger.info(f"Fetching history for {len(threads)} empty threads")
|
||||
new_messages = []
|
||||
for i,t in enumerate(threads):
|
||||
try:
|
||||
messages = slack_client.conversations_replies(
|
||||
channel = config["archive_id"],
|
||||
ts = t.slack_ts,
|
||||
oldest = t.messages[-1].slack_ts
|
||||
)["messages"]
|
||||
except SlackApiError:
|
||||
logger.error("Hit rate limit while querying threaded messages, retrying in {}s ({}/{} queries elapsed)".format(config["api_wait_time"], i, len(threads)))
|
||||
time.sleep(int(config["api_wait_time"]))
|
||||
messages = slack_client.conversations_replies(
|
||||
channel = config["archive_id"],
|
||||
ts = t.slack_ts,
|
||||
oldest = t.messages[-1].slack_ts
|
||||
)["messages"]
|
||||
|
||||
messages.pop(0) # the first message is the one posted in the channel. We already processed it!
|
||||
|
||||
for m in messages:
|
||||
# only append *new* messages
|
||||
res = message_dict_to_model(m)
|
||||
if res:
|
||||
new_messages.append(res)
|
||||
logger.info("Fetched {} new threaded messages.".format(len(new_messages)))
|
||||
|
||||
|
||||
def fetch_missed_channel_reactions():
|
||||
logger.info("Starting background fetch of channel reactions...")
|
||||
threads = [t for t in models.Thread.select() if not t.is_fully_processed]
|
||||
for i,t in enumerate(threads):
|
||||
reactions = []
|
||||
try:
|
||||
query = slack_client.reactions_get(
|
||||
channel = config["archive_id"],
|
||||
timestamp = t.slack_ts
|
||||
)
|
||||
reactions = query.get("message", []).get("reactions", []) # default = []
|
||||
except SlackApiError as e:
|
||||
if e.response.get("error", "") == "message_not_found":
|
||||
m = t.initiator_message
|
||||
logger.warning(f"Message (id={m.id}) not found. Skipping and saving...")
|
||||
# this usually means the message is past the 1000 message limit imposed by slack. Mark it as processed in the db
|
||||
m.is_processed_override = True
|
||||
m.save()
|
||||
else: # probably a rate_limit:
|
||||
logger.error("Hit rate limit while querying reactions. retrying in {}s ({}/{} queries elapsed)".format(config["api_wait_time"], i, len(threads)))
|
||||
time.sleep(int(config["api_wait_time"]))
|
||||
|
||||
for r in reactions:
|
||||
reaction_dict_to_model(r, t)
|
||||
|
||||
|
||||
|
||||
|
||||
# Helpers for message conversion to db-objects
|
||||
def reaction_dict_to_model(reaction, thread=None):
|
||||
if thread is None:
|
||||
m_ts = reaction["item"]["ts"]
|
||||
message = models.Message.get(ts = float(m_ts))
|
||||
thread = message.thread
|
||||
if "name" in reaction.keys(): # fetched through manual api query
|
||||
content = reaction["name"]
|
||||
elif "reaction" in reaction.keys(): # fetched through events
|
||||
content = reaction["reaction"]
|
||||
else:
|
||||
logger.error(f"Weird reaction received: {reaction}")
|
||||
return None
|
||||
|
||||
r, _ = models.Reaction.get_or_create(
|
||||
type = content,
|
||||
message = thread.initiator_message
|
||||
)
|
||||
logger.info("Saved reaction [{}]".format(content))
|
||||
return r
|
||||
|
||||
|
||||
def message_dict_to_model(message):
|
||||
if message["type"] == "message":
|
||||
thread_ts = message["thread_ts"] if "thread_ts" in message else message["ts"]
|
||||
uid = message.get("user", "BAD USER")
|
||||
if uid == "BAD USER":
|
||||
logger.critical("Message has no user?? {}".format(message))
|
||||
return None
|
||||
|
||||
user, _ = models.User.get_or_create(user_id = uid)
|
||||
thread, _ = models.Thread.get_or_create(thread_ts = thread_ts)
|
||||
m, new = models.Message.get_or_create(
|
||||
user = user,
|
||||
thread = thread,
|
||||
ts = message["ts"],
|
||||
channel_id = config["archive_id"],
|
||||
text = message["text"]
|
||||
)
|
||||
logger.info(f"Saved: {m} ({'new' if new else 'old'})")
|
||||
|
||||
files = message.get("files", [])
|
||||
if len(files) >= 1:
|
||||
f = files[0] #default: []
|
||||
m.file_type = f["filetype"]
|
||||
m.perma_link = f["url_private_download"]
|
||||
m.save()
|
||||
logger.info(f"Saved {m.file_type}-file for message (id={m.id})")
|
||||
if new:
|
||||
return m
|
||||
else:
|
||||
return None
|
||||
else:
|
||||
logger.warning("What should I do of {}".format(message))
|
||||
return None
|
||||
|
||||
|
||||
def say_substitute(*args, **kwargs):
|
||||
logger.info("Now sending message through say-substitute: {}".format(" - ".join(args)))
|
||||
slack_client.chat_postMessage(
|
||||
channel=config["archive_id"],
|
||||
text=" - ".join(args),
|
||||
**kwargs
|
||||
)
|
||||
|
||||
|
||||
def save_as_related_file(url, article_object):
|
||||
r = requests.get(url, headers={"Authorization": "Bearer {}".format(slack_client.token)})
|
||||
saveto = article_object.save_path
|
||||
ftype = url[url.rfind(".") + 1:]
|
||||
fname = "{} - related no {}.{}".format(
|
||||
article_object.file_name.replace(".pdf",""),
|
||||
len(article_object.related) + 1,
|
||||
ftype
|
||||
)
|
||||
with open(os.path.join(saveto, fname), "wb") as f:
|
||||
f.write(r.content)
|
||||
article_object.set_related([fname])
|
||||
logger.info("Added {} to model {}".format(fname, article_object))
|
||||
return fname
|
||||
|
||||
|
||||
def react_file_path_message(fname, article_object):
|
||||
saveto = article_object.save_path
|
||||
file_path = os.path.join(saveto, fname)
|
||||
if os.path.exists(file_path):
|
||||
article_object.set_related([fname])
|
||||
logger.info("Added {} to model {}".format(fname, article_object))
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
def is_message_in_archiving(message) -> bool:
|
||||
if isinstance(message, dict):
|
||||
return message["channel"] == config["archive_id"]
|
||||
else:
|
||||
return message.channel_id == config["archive_id"]
|
||||
|
||||
|
||||
def is_reaction_in_archiving(event) -> bool:
|
||||
if isinstance(event, dict):
|
||||
return event["item"]["channel"] == config["archive_id"]
|
||||
else:
|
||||
return event.message.channel_id == config["archive_id"]
|
@ -1,189 +0,0 @@
|
||||
from slack_bolt import App
|
||||
from slack_bolt.adapter.socket_mode import SocketModeHandler
|
||||
from slack_sdk.errors import SlackApiError
|
||||
|
||||
import logging
|
||||
import configuration
|
||||
|
||||
from . import message_helpers
|
||||
|
||||
|
||||
config = configuration.parsed["SLACK"]
|
||||
models = configuration.models
|
||||
|
||||
class BotApp(App):
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def __init__(self, callback, *args, **kwargs):
|
||||
|
||||
super().__init__(*args, **kwargs)
|
||||
self.callback = callback
|
||||
|
||||
def pre_start(self):
|
||||
message_helpers.init(self.client)
|
||||
missed_messages, missed_reactions = message_helpers.get_unhandled_messages()
|
||||
|
||||
[self.handle_incoming_message(m) for m in missed_messages]
|
||||
[self.handle_incoming_reaction(r) for r in missed_reactions]
|
||||
|
||||
# self.react_missed_reactions(missed_reactions)
|
||||
# self.react_missed_messages(missed_messages)
|
||||
self.startup_status()
|
||||
|
||||
|
||||
|
||||
def handle_incoming_reaction(self, reaction):
|
||||
if isinstance(reaction, dict): #else: the reaction is already being passed as a model
|
||||
# CAUTION: filter for 'changed reactions' those are nasty (usually when adding an url)
|
||||
reaction = message_helpers.reaction_dict_to_model(reaction)
|
||||
|
||||
thread = reaction.message.thread
|
||||
article_object = thread.article
|
||||
if not article_object is None:
|
||||
reaction = reaction.type
|
||||
status = 1 if reaction == "white_check_mark" else -1
|
||||
|
||||
# self.logger.info(f"Applying reaction {reaction} to its root message.")
|
||||
article_object.verified = status
|
||||
article_object.save()
|
||||
|
||||
|
||||
def handle_incoming_message(self, message):
|
||||
"""Reacts to all messages inside channel archiving. Must then
|
||||
distinguish between threaded replies and new requests
|
||||
and react accordingly"""
|
||||
if isinstance(message, dict): #else: the message is already being passed as a model
|
||||
# CAUTION: filter for 'changed messages' those are nasty (usually when adding an url)
|
||||
if message.get("subtype", "not bad") == "message_changed":
|
||||
return False
|
||||
message = message_helpers.message_dict_to_model(message)
|
||||
|
||||
# First check: belongs to thread?
|
||||
is_threaded = message.thread.message_count > 1 and message != message.thread.initiator_message
|
||||
if is_threaded:
|
||||
self.incoming_thread_message(message)
|
||||
else:
|
||||
self.incoming_channel_message(message)
|
||||
|
||||
|
||||
def incoming_thread_message(self, message):
|
||||
if message.user.user_id == config["bot_id"]:
|
||||
return True # ignore the files uploaded by the bot. We handled them already!
|
||||
|
||||
thread = message.thread
|
||||
if thread.is_fully_processed:
|
||||
return True
|
||||
|
||||
self.logger.info("Receiving thread-message")
|
||||
self.respond_thread_message(message)
|
||||
|
||||
|
||||
def incoming_channel_message(self, message):
|
||||
self.logger.info(f"Handling message {message} ({len(message.urls)} urls)")
|
||||
|
||||
if not message.urls: # no urls in a root-message => IGNORE
|
||||
message.is_processed_override = True
|
||||
message.save()
|
||||
return
|
||||
|
||||
# ensure thread is still empty, this is a scenario encountered only in testing, but let's just filter it
|
||||
if message.thread.message_count > 1:
|
||||
self.logger.info("Discarded message because it is actually processed.")
|
||||
return
|
||||
|
||||
if len(message.urls) > 1:
|
||||
message_helpers.say_substitute("Only the first url is being handled. Please send any subsequent url as a separate message", thread_ts=message.thread.slack_ts)
|
||||
|
||||
self.callback(message)
|
||||
# for url in message.urls:
|
||||
# self.callback(url, message)
|
||||
# stop here!
|
||||
|
||||
|
||||
|
||||
def respond_thread_message(self, message, say=message_helpers.say_substitute):
|
||||
thread = message.thread
|
||||
article = thread.article
|
||||
if message.perma_link: # file upload means new data
|
||||
fname = message_helpers.save_as_related_file(message.perma_link, article)
|
||||
say("File was saved as 'related file' under `{}`.".format(fname),
|
||||
thread_ts=thread.slack_ts
|
||||
)
|
||||
else: # either a pointer to a new file (too large to upload), or trash
|
||||
success = message_helpers.react_file_path_message(message.text, article)
|
||||
if success:
|
||||
say("File was saved as 'related file'", thread_ts=thread.slack_ts)
|
||||
else:
|
||||
self.logger.error("User replied to thread {} but the response did not contain a file/path".format(thread))
|
||||
say("Cannot process response without associated file.",
|
||||
thread_ts=thread.slack_ts
|
||||
)
|
||||
|
||||
|
||||
def respond_channel_message(self, thread, say=message_helpers.say_substitute):
|
||||
article = thread.article
|
||||
answers = article.slack_info
|
||||
for a in answers:
|
||||
if a["file_path"]:
|
||||
try: # upload resulted in an error
|
||||
self.client.files_upload(
|
||||
channels = config["archive_id"],
|
||||
initial_comment = f"<@{config['responsible_id']}> \n {a['reply_text']}",
|
||||
file = a["file_path"],
|
||||
thread_ts = thread.slack_ts
|
||||
)
|
||||
status = True
|
||||
except SlackApiError as e:
|
||||
say(
|
||||
"File {} could not be uploaded.".format(a),
|
||||
thread_ts=thread.slack_ts
|
||||
)
|
||||
status = False
|
||||
self.logger.error(f"File upload failed: {e}")
|
||||
else: # anticipated that there is no file!
|
||||
say(
|
||||
f"<@{config['responsible_id']}> \n {a['reply_text']}",
|
||||
thread_ts=thread.slack_ts
|
||||
)
|
||||
status = True
|
||||
|
||||
|
||||
def startup_status(self):
|
||||
threads = [t for t in models.Thread.select()]
|
||||
all_threads = len(threads)
|
||||
fully_processed = len([t for t in threads if t.is_fully_processed])
|
||||
fully_unprocessed = len([t for t in threads if t.message_count == 1])
|
||||
articles_unprocessed = len(models.ArticleDownload.select().where(models.ArticleDownload.verified < 1))
|
||||
self.logger.info(f"[bold]STATUS[/bold]: Fully processed {fully_processed}/{all_threads} threads. {fully_unprocessed} threads have 0 replies. Article-objects to verify: {articles_unprocessed}", extra={"markup": True})
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
class BotRunner():
|
||||
"""Stupid encapsulation so that we can apply the slack decorators to the BotApp"""
|
||||
def __init__(self, callback, *args, **kwargs) -> None:
|
||||
self.bot_worker = BotApp(callback, token=config["auth_token"])
|
||||
|
||||
@self.bot_worker.event(event="message", matchers=[message_helpers.is_message_in_archiving])
|
||||
def handle_incoming_message(message, say):
|
||||
return self.bot_worker.handle_incoming_message(message)
|
||||
|
||||
@self.bot_worker.event(event="reaction_added", matchers=[message_helpers.is_reaction_in_archiving])
|
||||
def handle_incoming_reaction(event, say):
|
||||
return self.bot_worker.handle_incoming_reaction(event)
|
||||
|
||||
self.handler = SocketModeHandler(self.bot_worker, config["app_token"])
|
||||
|
||||
|
||||
def start(self):
|
||||
self.bot_worker.pre_start()
|
||||
self.handler.start()
|
||||
|
||||
|
||||
def stop(self):
|
||||
self.handler.close()
|
||||
print("Bye handler!")
|
||||
|
||||
# def respond_to_message(self, message):
|
||||
# self.bot_worker.handle_incoming_message(message)
|
@ -1,331 +0,0 @@
|
||||
import logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
from peewee import *
|
||||
import os
|
||||
import markdown
|
||||
import re
|
||||
import configuration
|
||||
import datetime
|
||||
|
||||
config = configuration.parsed["DOWNLOADS"]
|
||||
slack_config = configuration.parsed["SLACK"]
|
||||
|
||||
## Helpers
|
||||
chat_db = DatabaseProxy()
|
||||
download_db = DatabaseProxy()
|
||||
|
||||
# set the nature of the db at runtime
|
||||
|
||||
class DownloadBaseModel(Model):
|
||||
class Meta:
|
||||
database = download_db
|
||||
|
||||
class ChatBaseModel(Model):
|
||||
class Meta:
|
||||
database = chat_db
|
||||
|
||||
|
||||
|
||||
## == Article related models == ##
|
||||
class ArticleDownload(DownloadBaseModel):
|
||||
title = CharField(default='')
|
||||
pub_date = DateField(default = '')
|
||||
download_date = DateField(default = datetime.date.today)
|
||||
source_name = CharField(default = '')
|
||||
article_url = TextField(default = '', unique=True)
|
||||
archive_url = TextField(default = '')
|
||||
file_name = TextField(default = '')
|
||||
language = CharField(default = '')
|
||||
summary = TextField(default = '')
|
||||
comment = TextField(default = '')
|
||||
verified = IntegerField(default = False)
|
||||
# authors
|
||||
# keywords
|
||||
# ... are added through foreignkeys
|
||||
|
||||
def __str__(self) -> str:
|
||||
if self.title != '' and self.source_name != '':
|
||||
desc = f"{shorten_name(self.title)} -- {self.source_name}"
|
||||
else:
|
||||
desc = f"{self.article_url}"
|
||||
return f"ART [{desc}]"
|
||||
|
||||
## Useful Properties
|
||||
@property
|
||||
def save_path(self):
|
||||
return f"{config['local_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/"
|
||||
|
||||
def fname_nas(self, file_name=""):
|
||||
if self.download_date:
|
||||
if file_name:
|
||||
return "NAS: {}/{}/{}/{}".format(config["remote_storage_path"], self.download_date.year, self.download_date.strftime("%B"), file_name)
|
||||
else: # return the self. name
|
||||
return "NAS: {}/{}/{}/{}".format(config["remote_storage_path"], self.download_date.year, self.download_date.strftime("%B"), self.file_name)
|
||||
else:
|
||||
return None
|
||||
|
||||
@property
|
||||
def fname_template(self):
|
||||
if "youtube.com" in self.source_name or "youtu.be" in self.source_name:
|
||||
fname = "{} -- {}".format(self.source_name, self.title)
|
||||
else:
|
||||
fname = "{} -- {}.pdf".format(self.source_name, self.title)
|
||||
return clear_path_name(fname)
|
||||
|
||||
@property
|
||||
def is_title_bad(self): # add incrementally
|
||||
return "PUR-Abo" in self.title \
|
||||
or "Redirecting" in self.title \
|
||||
or "Error while running fetch" in self.title
|
||||
|
||||
@property
|
||||
def slack_info(self):
|
||||
status = [":x: No better version available", ":gear: Verification pending", ":white_check_mark: Verified by human"][self.verified + 1]
|
||||
content = "\n>" + "\n>".join(self.summary.split("\n"))
|
||||
file_status, msg = self.file_status()
|
||||
if not file_status:
|
||||
return [msg]
|
||||
|
||||
# everything alright: generate real content
|
||||
# first the base file
|
||||
if self.file_name[-4:] == ".pdf":
|
||||
answer = [{ # main reply with the base pdf
|
||||
"reply_text" : f"*{self.title}*\n{status}\n{content}",
|
||||
"file_path" : self.save_path + self.file_name
|
||||
}]
|
||||
else: # don't upload if the file is too big!
|
||||
location = "Not uploaded to slack, but the file will be on the NAS:\n`{}`".format(self.fname_nas())
|
||||
answer = [{ # main reply with the base pdf
|
||||
"reply_text" : "*{}*\n{}\n{}\n{}".format(self.title, status, content, location),
|
||||
"file_path" : None
|
||||
}]
|
||||
|
||||
# then the related files
|
||||
rel_text = ""
|
||||
for r in self.related:
|
||||
fname = r.related_file_name
|
||||
lentry = "\n• `{}` ".format(self.fname_nas(fname))
|
||||
if fname[-4:] == ".pdf": # this is a manageable file, directly upload
|
||||
f_ret = self.save_path + fname
|
||||
answer.append({"reply_text":"", "file_path" : f_ret})
|
||||
else: # not pdf <=> too large. Don't upload but mention its existence
|
||||
lentry += "(not uploaded to slack, but the file will be on the NAS)"
|
||||
|
||||
rel_text += lentry
|
||||
|
||||
if rel_text:
|
||||
rel_text = answer[0]["reply_text"] = answer[0]["reply_text"] + "\nRelated files:\n" + rel_text
|
||||
|
||||
return answer
|
||||
|
||||
@property
|
||||
def mail_info(self):
|
||||
base = [{"reply_text": "[{}]({})\n".format(self.article_url, self.article_url), "file_path":None}] + self.slack_info
|
||||
return [{"reply_text": markdown.markdown(m["reply_text"]), "file_path": m["file_path"]} for m in base]
|
||||
|
||||
|
||||
## Helpers
|
||||
def set_keywords(self, keywords):
|
||||
for k in keywords:
|
||||
ArticleKeyword.create(
|
||||
article = self,
|
||||
keyword = k
|
||||
)
|
||||
|
||||
def set_authors(self, authors):
|
||||
for a in authors:
|
||||
ArticleAuthor.create(
|
||||
article = self,
|
||||
author = a
|
||||
)
|
||||
|
||||
def set_references(self, references):
|
||||
for r in references:
|
||||
ArticleReference.create(
|
||||
article = self,
|
||||
reference_url = r
|
||||
)
|
||||
|
||||
def set_related(self, related):
|
||||
for r in related:
|
||||
ArticleRelated.create(
|
||||
article = self,
|
||||
related_file_name = r
|
||||
)
|
||||
|
||||
def file_status(self):
|
||||
if not self.file_name:
|
||||
logger.error("Article {} has no filename!".format(self))
|
||||
return False, {"reply_text": "Download failed, no file was saved.", "file_path": None}
|
||||
|
||||
file_path_abs = self.save_path + self.file_name
|
||||
if not os.path.exists(file_path_abs):
|
||||
logger.error("Article {} has a filename, but the file does not exist at that location!".format(self))
|
||||
return False, {"reply_text": "Can't find file. Either the download failed or the file was moved.", "file_path": None}
|
||||
|
||||
return True, {}
|
||||
|
||||
|
||||
class ArticleKeyword(DownloadBaseModel):
|
||||
# instance gets created for every one keyword -> flexible in size
|
||||
article = ForeignKeyField(ArticleDownload, backref='keywords')
|
||||
keyword = CharField()
|
||||
|
||||
|
||||
class ArticleAuthor(DownloadBaseModel):
|
||||
article = ForeignKeyField(ArticleDownload, backref='authors')
|
||||
author = CharField()
|
||||
|
||||
|
||||
class ArticleReference(DownloadBaseModel):
|
||||
article = ForeignKeyField(ArticleDownload, backref='references')
|
||||
reference_url = TextField(default = '')
|
||||
|
||||
|
||||
class ArticleRelated(DownloadBaseModel):
|
||||
article = ForeignKeyField(ArticleDownload, backref='related')
|
||||
related_file_name = TextField(default = '')
|
||||
|
||||
|
||||
|
||||
|
||||
## == Slack-thread related models == ##
|
||||
class User(ChatBaseModel):
|
||||
user_id = CharField(default='', unique=True)
|
||||
# messages
|
||||
|
||||
|
||||
class Thread(ChatBaseModel):
|
||||
"""The threads that concern us are only created if the base massage contains a url"""
|
||||
thread_ts = FloatField(default = 0)
|
||||
article = ForeignKeyField(ArticleDownload, backref="slack_thread", null=True, default=None)
|
||||
# provides, ts, user, models
|
||||
# messages
|
||||
|
||||
@property
|
||||
def slack_ts(self):
|
||||
str_ts = str(self.thread_ts)
|
||||
cut_zeros = 6 - (len(str_ts) - str_ts.find(".") - 1) # usually there a 6 decimals. If there are less, problem!
|
||||
return "{}{}".format(str_ts, cut_zeros*"0")
|
||||
|
||||
@property
|
||||
def initiator_message(self):
|
||||
try:
|
||||
return self.messages[0] # TODO check if this needs sorting
|
||||
except IndexError:
|
||||
logger.warning(f"Thread {self} is empty. How can that be?")
|
||||
return None
|
||||
|
||||
@property
|
||||
def message_count(self):
|
||||
# logger.warning("message_count was called")
|
||||
return self.messages.count()
|
||||
|
||||
@property
|
||||
def last_message(self):
|
||||
messages = Message.select().where(Message.thread == self).order_by(Message.ts) # can't be empty by definition/creation
|
||||
return messages[-1]
|
||||
|
||||
@property
|
||||
def is_fully_processed(self) -> bool:
|
||||
init_message = self.initiator_message
|
||||
if init_message is None:
|
||||
return False
|
||||
|
||||
if init_message.is_processed_override:
|
||||
return True
|
||||
# this override is set for instance, when no url was sent at all. Then set this thread to be ignored
|
||||
|
||||
reactions = init_message.reaction
|
||||
if not reactions:
|
||||
return False
|
||||
else:
|
||||
r = reactions[0].type # can and should only have one reaction
|
||||
return r == "white_check_mark" \
|
||||
or r == "x"
|
||||
|
||||
|
||||
|
||||
class Message(ChatBaseModel):
|
||||
ts = FloatField(unique=True) #for sorting
|
||||
channel_id = CharField(default='')
|
||||
user = ForeignKeyField(User, backref="messages")
|
||||
text = TextField(default='')
|
||||
thread = ForeignKeyField(Thread, backref="messages", default=None)
|
||||
file_type = CharField(default='')
|
||||
perma_link = CharField(default='')
|
||||
is_processed_override = BooleanField(default=False)
|
||||
# reaction
|
||||
|
||||
def __str__(self) -> str:
|
||||
return "MSG [{}]".format(shorten_name(self.text).replace('\n','/'))
|
||||
|
||||
@property
|
||||
def slack_ts(self):
|
||||
str_ts = str(self.ts)
|
||||
cut_zeros = 6 - (len(str_ts) - str_ts.find(".") - 1) # usually there a 6 decimals. If there are less, problem!
|
||||
return "{}{}".format(str_ts, cut_zeros * "0")
|
||||
|
||||
|
||||
@property
|
||||
def urls(self):
|
||||
pattern = r"<(.*?)>"
|
||||
matches = re.findall(pattern, self.text)
|
||||
matches = [m for m in matches if "." in m]
|
||||
|
||||
new_matches = []
|
||||
for m in matches:
|
||||
if "." in m: # must contain a tld, right?
|
||||
# further complication: slack automatically abreviates urls in the format:
|
||||
# <url|link preview>. Lucky for us, "|" is a character derecommended in urls, meaning we can "safely" split for it and retain the first half
|
||||
if "|" in m:
|
||||
keep = m.split("|")[0]
|
||||
else:
|
||||
keep = m
|
||||
new_matches.append(keep)
|
||||
return new_matches
|
||||
|
||||
@property
|
||||
def is_by_human(self):
|
||||
return self.user.user_id != slack_config["bot_id"]
|
||||
|
||||
|
||||
@property
|
||||
def has_single_url(self):
|
||||
return len(self.urls) == 1
|
||||
|
||||
|
||||
class Reaction(ChatBaseModel):
|
||||
type = CharField(default = "")
|
||||
message = ForeignKeyField(Message, backref="reaction")
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
def create_tables():
|
||||
with download_db:
|
||||
download_db.create_tables([ArticleDownload, ArticleKeyword, ArticleAuthor, ArticleReference, ArticleRelated])
|
||||
with chat_db:
|
||||
chat_db.create_tables([User, Message, Thread, Reaction])
|
||||
|
||||
|
||||
def set_db(chat_db_object, download_db_object):
|
||||
chat_db.initialize(chat_db_object)
|
||||
download_db.initialize(download_db_object)
|
||||
create_tables()
|
||||
|
||||
def clear_path_name(path):
|
||||
keepcharacters = (' ','.','_', '-')
|
||||
converted = "".join([c if (c.isalnum() or c in keepcharacters) else "_" for c in path]).rstrip()
|
||||
return converted
|
||||
|
||||
def shorten_name(name, offset = 50):
|
||||
if len(name) > offset:
|
||||
return name[:offset] + "..."
|
||||
else:
|
||||
return name
|
68
news_fetch/configuration.py
Normal file
68
news_fetch/configuration.py
Normal file
@ -0,0 +1,68 @@
|
||||
import os
|
||||
import configparser
|
||||
import logging
|
||||
import time
|
||||
import shutil
|
||||
from datetime import datetime
|
||||
from peewee import SqliteDatabase, PostgresqlDatabase
|
||||
from rich.logging import RichHandler
|
||||
|
||||
|
||||
# first things first: logging
|
||||
logging.basicConfig(
|
||||
format='%(message)s',
|
||||
level=logging.INFO,
|
||||
datefmt='%H:%M:%S', # add %Y-%m-%d if needed
|
||||
handlers=[RichHandler()]
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# load config file containing constants and secrets
|
||||
main_config = configparser.ConfigParser()
|
||||
main_config.read("/app/containerdata/config/news_fetch.config.ini")
|
||||
db_config = configparser.ConfigParser()
|
||||
db_config.read("/app/containerdata/config/db.config.ini")
|
||||
|
||||
|
||||
# DEBUG MODE:
|
||||
if os.getenv("DEBUG", "false") == "true":
|
||||
logger.warning("Found 'DEBUG=true', setting up dummy databases")
|
||||
|
||||
main_config["SLACK"]["archive_id"] = main_config["SLACK"]["debug_id"]
|
||||
main_config["MAIL"]["recipient"] = main_config["MAIL"]["sender"]
|
||||
main_config["DOWNLOADS"]["local_storage_path"] = main_config["DOWNLOADS"]["debug_storage_path"]
|
||||
|
||||
download_db = SqliteDatabase(
|
||||
main_config["DATABASE"]["download_db_debug"],
|
||||
pragmas = {'journal_mode': 'wal'} # mutliple threads can read at once
|
||||
)
|
||||
|
||||
# PRODUCTION MODE:
|
||||
else:
|
||||
logger.warning("Found 'DEBUG=false' and running on production databases, I hope you know what you're doing...")
|
||||
|
||||
time.sleep(10) # wait for the vpn to connect (can't use a healthcheck because there is no depends_on)
|
||||
cred = db_config["DATABASE"]
|
||||
download_db = PostgresqlDatabase(
|
||||
cred["db_name"], user=cred["user_name"], password=cred["password"], host="vpn", port=5432
|
||||
)
|
||||
# TODO Reimplement backup/printout
|
||||
# logger.info("Backing up databases")
|
||||
# backup_dst = main_config["DATABASE"]["db_backup"]
|
||||
# today = datetime.today().strftime("%Y.%m.%d")
|
||||
# shutil.copyfile(
|
||||
# os.path.join(db_base_path, main_config["DATABASE"]["chat_db_name"]),
|
||||
# os.path.join(backup_dst, today + "." + main_config["DATABASE"]["chat_db_name"]),
|
||||
# )
|
||||
# shutil.copyfile(
|
||||
# os.path.join(db_base_path, main_config["DATABASE"]["download_db_name"]),
|
||||
# os.path.join(backup_dst, today + "." + main_config["DATABASE"]["download_db_name"]),
|
||||
# )
|
||||
|
||||
|
||||
|
||||
from utils_storage import models
|
||||
|
||||
# Set up the database
|
||||
models.set_db(download_db)
|
@ -8,3 +8,4 @@ newspaper3k
|
||||
htmldate
|
||||
markdown
|
||||
rich
|
||||
psycopg2
|
183
news_fetch/runner.py
Normal file
183
news_fetch/runner.py
Normal file
@ -0,0 +1,183 @@
|
||||
"""Main coordination of other util classes. Handles inbound and outbound calls"""
|
||||
import configuration
|
||||
models = configuration.models
|
||||
from threading import Thread
|
||||
import logging
|
||||
logger = logging.getLogger(__name__)
|
||||
import sys
|
||||
from collections import OrderedDict
|
||||
|
||||
|
||||
from utils_mail import runner as MailRunner
|
||||
from utils_slack import runner as SlackRunner
|
||||
from utils_worker.workers import CompressWorker, DownloadWorker, FetchWorker, UploadWorker
|
||||
|
||||
|
||||
class ArticleWatcher:
|
||||
"""Wrapper for a newly created article object. Notifies the coordinator upon change/completition"""
|
||||
def __init__(self, article, workers_in, workers_out) -> None:
|
||||
self.article = article
|
||||
|
||||
self.workers_in = workers_in
|
||||
self.workers_out = workers_out
|
||||
|
||||
self.completition_notified = False
|
||||
|
||||
for w_dict in self.workers_in:
|
||||
worker = self.get_next_worker(w_dict) # gets the first worker of each dict (they get processed independently)
|
||||
worker.process(self)
|
||||
|
||||
|
||||
def get_next_worker(self, worker_dict, worker_name=""):
|
||||
"""Returns the worker coming after the one with key worker_name"""
|
||||
|
||||
if worker_name == "": # first one
|
||||
return worker_dict[list(worker_dict.keys())[0]]
|
||||
# for i,w_dict in enumerate(workers_list):
|
||||
keys = list(worker_dict.keys())
|
||||
next_key_ind = keys.index(worker_name) + 1
|
||||
try:
|
||||
key = keys[next_key_ind]
|
||||
return worker_dict[key]
|
||||
except IndexError:
|
||||
return None
|
||||
|
||||
|
||||
def update(self, worker_name):
|
||||
"""Called by the workers to notify the watcher of a completed step"""
|
||||
for w_dict in self.workers_in:
|
||||
if worker_name in w_dict.keys():
|
||||
next_worker = self.get_next_worker(w_dict, worker_name)
|
||||
if next_worker:
|
||||
if next_worker == "out":
|
||||
self.completion_notifier()
|
||||
else: # it's just another in-worker
|
||||
next_worker.process(self)
|
||||
else: # no next worker, we are done
|
||||
logger.info(f"No worker after {worker_name}")
|
||||
|
||||
|
||||
def completion_notifier(self):
|
||||
"""Triggers the out-workers to process the article, that is to send out a message"""
|
||||
for w_dict in self.workers_out:
|
||||
worker = self.get_next_worker(w_dict)
|
||||
worker.send(self.article)
|
||||
self.article.sent = True
|
||||
self.article.save()
|
||||
|
||||
|
||||
def __str__(self) -> str:
|
||||
return f"ArticleWatcher with id {self.article_id}"
|
||||
|
||||
|
||||
|
||||
class Dispatcher(Thread):
|
||||
def __init__(self) -> None:
|
||||
"""Thread to handle handle incoming requests and control the workers"""
|
||||
self.workers_in = []
|
||||
self.workers_out = []
|
||||
super().__init__(target = self.launch)
|
||||
|
||||
|
||||
def launch(self) -> None:
|
||||
# start workers (each worker is a thread)
|
||||
for w_dict in self.workers_in: # for reduced operations such as upload, some workers are not set
|
||||
for w in w_dict.values():
|
||||
if isinstance(w, Thread):
|
||||
w.start()
|
||||
|
||||
# get all articles not fully processed
|
||||
unsent = models.ArticleDownload.filter(sent = False) # if past messages have not been sent, they must be reevaluated
|
||||
for a in unsent:
|
||||
self.incoming_request(article=a)
|
||||
|
||||
|
||||
def incoming_request(self, message=None, article=None):
|
||||
"""This method is passed onto the slack worker. It then is called when a new message is received."""
|
||||
|
||||
if message is not None:
|
||||
try:
|
||||
url = message.urls[0] # ignore all the other ones
|
||||
except IndexError:
|
||||
return
|
||||
article, is_new = models.ArticleDownload.get_or_create(article_url=url)
|
||||
article.slack_ts = message.ts # either update the timestamp (to the last reference to the article) or set it for the first time
|
||||
article.save()
|
||||
elif article is not None:
|
||||
is_new = False
|
||||
logger.info(f"Received article {article} in incoming_request")
|
||||
else:
|
||||
logger.error("Dispatcher.incoming_request called with no arguments")
|
||||
return
|
||||
|
||||
if is_new or (article.file_name == "" and article.verified == 0):
|
||||
# check for models that were created but were abandonned. This means they have missing information, most importantly no associated file
|
||||
# this overwrites previously set information, but that should not be too important
|
||||
ArticleWatcher(
|
||||
article,
|
||||
workers_in=self.workers_in,
|
||||
workers_out=self.workers_out,
|
||||
)
|
||||
|
||||
else: # manually trigger notification immediatly
|
||||
logger.info(f"Found existing article {article}. Now sending")
|
||||
self.article_complete_notifier(article)
|
||||
|
||||
|
||||
|
||||
# def manual_processing(self, articles, workers):
|
||||
# for w in workers:
|
||||
# w.start()
|
||||
|
||||
# for article in articles:
|
||||
# notifier = lambda article: logger.info(f"Completed manual actions for {article}")
|
||||
# ArticleWatcher(article, workers_manual = workers, notifier = notifier) # Article watcher wants a thread to link article to TODO: handle threads as a kwarg
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
dispatcher = Dispatcher()
|
||||
|
||||
if "upload" in sys.argv:
|
||||
class PrintWorker:
|
||||
def send(self, article):
|
||||
print(f"Uploaded article {article}")
|
||||
|
||||
articles = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "" or models.ArticleDownload.archive_url == "TODO:UPLOAD").execute()
|
||||
logger.info(f"Launching upload to archive for {len(articles)} articles.")
|
||||
|
||||
dispatcher.workers_in = [{"UploadWorker": UploadWorker()}]
|
||||
dispatcher.workers_out = [{"PrintWorker": PrintWorker()}]
|
||||
dispatcher.start()
|
||||
|
||||
else: # launch with full action
|
||||
try:
|
||||
slack_runner = SlackRunner.BotRunner(dispatcher.incoming_request)
|
||||
# All workers are implemented as a threaded queue. But the individual model requires a specific processing order:
|
||||
# fetch -> download -> compress -> complete
|
||||
# This is reflected in the following list of workers:
|
||||
workers_in = [
|
||||
OrderedDict({"FetchWorker": FetchWorker(), "DownloadWorker": DownloadWorker(), "CompressWorker": CompressWorker(), "NotifyRunner": "out"}),
|
||||
OrderedDict({"UploadWorker": UploadWorker()})
|
||||
]
|
||||
# The two dicts are processed independently. First element of first dict is called at the same time as the first element of the second dict
|
||||
# Inside a dict, the order of the keys gives the order of execution (only when the first element is done, the second is called, etc...)
|
||||
|
||||
workers_out = [{"SlackRunner": slack_runner},{"MailRunner": MailRunner}]
|
||||
|
||||
dispatcher.workers_in = workers_in
|
||||
dispatcher.workers_out = workers_out
|
||||
|
||||
dispatcher.start() # starts the thread, (ie. runs launch())
|
||||
slack_runner.start() # last one to start, inside the main thread
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Keyboard interrupt. Stopping Slack and dispatcher")
|
||||
slack_runner.stop()
|
||||
dispatcher.join()
|
||||
for w_dict in workers_in:
|
||||
for w in w_dict.values():
|
||||
if isinstance(w, Thread):
|
||||
w.stop()
|
||||
|
||||
# All threads are launched as a daemon thread, meaning that any 'leftover' should exit along with the sys call
|
||||
sys.exit(0)
|
@ -23,7 +23,7 @@ u_options = {
|
||||
|
||||
|
||||
bot_client = WebClient(
|
||||
token = configuration.parsed["SLACK"]["auth_token"]
|
||||
token = configuration.main_config["SLACK"]["auth_token"]
|
||||
)
|
||||
|
||||
|
||||
@ -70,7 +70,7 @@ def send_reaction_to_slack_thread(article, reaction):
|
||||
else:
|
||||
ts = m.slack_ts
|
||||
bot_client.reactions_add(
|
||||
channel=configuration.parsed["SLACK"]["archive_id"],
|
||||
channel=configuration.main_config["SLACK"]["archive_id"],
|
||||
name=reaction,
|
||||
timestamp=ts
|
||||
)
|
@ -7,7 +7,7 @@ import logging
|
||||
import configuration
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
config = configuration.parsed["MAIL"]
|
||||
config = configuration.main_config["MAIL"]
|
||||
|
||||
def send(article_model):
|
||||
mail = MIMEMultipart()
|
246
news_fetch/utils_slack/runner.py
Normal file
246
news_fetch/utils_slack/runner.py
Normal file
@ -0,0 +1,246 @@
|
||||
from slack_bolt import App
|
||||
from slack_bolt.adapter.socket_mode import SocketModeHandler
|
||||
from slack_sdk.errors import SlackApiError
|
||||
|
||||
import logging
|
||||
import re
|
||||
import time
|
||||
|
||||
import configuration
|
||||
config = configuration.main_config["SLACK"]
|
||||
models = configuration.models
|
||||
|
||||
class MessageIsUnwanted(Exception):
|
||||
# This exception is triggered when the message is either threaded (reply to another message) or weird (like an edit, a deletion, etc)
|
||||
pass
|
||||
|
||||
class Message:
|
||||
ts = str
|
||||
user_id = str
|
||||
text = str
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def __init__(self, message_dict):
|
||||
if message_dict.get("subtype", "not bad") == "message_changed":
|
||||
raise MessageIsUnwanted()
|
||||
if message_dict["type"] == "message":
|
||||
if "thread_ts" in message_dict and (message_dict["thread_ts"] != message_dict["ts"]): # meaning it's a reply to another message
|
||||
raise MessageIsUnwanted()
|
||||
|
||||
self.user_id = message_dict.get("user", "BAD USER")
|
||||
# self.channel_id = config["archive_id"] # by construction, other messages are not intercepted
|
||||
self.ts = message_dict["ts"]
|
||||
self.text = message_dict["text"]
|
||||
|
||||
else:
|
||||
self.logger.warning(f"What should I do of {message_dict}")
|
||||
raise MessageIsUnwanted()
|
||||
|
||||
|
||||
def __str__(self) -> str:
|
||||
return f"MSG [{self.text}]"
|
||||
|
||||
|
||||
@property
|
||||
def urls(self):
|
||||
pattern = r"<(.*?)>"
|
||||
matches = re.findall(pattern, self.text)
|
||||
matches = [m for m in matches if "." in m] # must contain a tld, right?
|
||||
|
||||
new_matches = []
|
||||
for m in matches:
|
||||
# further complication: slack automatically abreviates urls in the format:
|
||||
# <url|link preview>. Lucky for us, "|" is a character derecommended in urls, meaning we can "safely" split for it and retain the first half
|
||||
if "|" in m:
|
||||
keep = m.split("|")[0]
|
||||
else:
|
||||
keep = m
|
||||
new_matches.append(keep)
|
||||
return new_matches
|
||||
|
||||
@property
|
||||
def is_by_human(self):
|
||||
return self.user.user_id != config["bot_id"]
|
||||
|
||||
|
||||
@property
|
||||
def has_single_url(self):
|
||||
return len(self.urls) == 1
|
||||
|
||||
|
||||
|
||||
class BotApp(App):
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def __init__(self, callback, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.callback = callback
|
||||
|
||||
|
||||
def pre_start(self):
|
||||
missed_messages = self.fetch_missed_channel_messages()
|
||||
|
||||
[self.handle_incoming_message(m) for m in missed_messages]
|
||||
self.startup_status()
|
||||
|
||||
|
||||
def say_substitute(self, *args, **kwargs):
|
||||
self.client.chat_postMessage(
|
||||
channel=config["archive_id"],
|
||||
text=" - ".join(args),
|
||||
**kwargs
|
||||
)
|
||||
|
||||
def fetch_missed_channel_messages(self):
|
||||
# latest processed message_ts is:
|
||||
presaved = models.ArticleDownload.select().order_by(models.ArticleDownload.slack_ts.desc()).get_or_none()
|
||||
if presaved is None:
|
||||
last_ts = 0
|
||||
else:
|
||||
last_ts = presaved.slack_ts_full
|
||||
|
||||
result = self.client.conversations_history(
|
||||
channel=config["archive_id"],
|
||||
oldest=last_ts
|
||||
)
|
||||
|
||||
new_messages = result.get("messages", [])
|
||||
# # filter the last one, it is a duplicate! (only if the db is not empty!)
|
||||
# if last_ts != 0 and len(new_messages) != 0:
|
||||
# new_messages.pop(-1)
|
||||
|
||||
return_messages = [Message(m) for m in new_messages]
|
||||
|
||||
refetch = result.get("has_more", False)
|
||||
while refetch: # we have not actually fetched them all
|
||||
try:
|
||||
result = self.client.conversations_history(
|
||||
channel = config["archive_id"],
|
||||
cursor = result["response_metadata"]["next_cursor"],
|
||||
oldest = last_ts
|
||||
) # fetches 100 messages, older than the [-1](=oldest) element of new_fetches
|
||||
refetch = result.get("has_more", False)
|
||||
|
||||
new_messages = result.get("messages", [])
|
||||
for m in new_messages:
|
||||
return_messages.append(Message(m))
|
||||
except SlackApiError: # Most likely a rate-limit
|
||||
self.logger.error("Error while fetching channel messages. (likely rate limit) Retrying in {} seconds...".format(config["api_wait_time"]))
|
||||
time.sleep(config["api_wait_time"])
|
||||
refetch = True
|
||||
|
||||
self.logger.info(f"Fetched {len(return_messages)} new channel messages.")
|
||||
return return_messages
|
||||
|
||||
|
||||
|
||||
def handle_incoming_message(self, message, say=None):
|
||||
"""Reacts to all messages inside channel archiving. This either gets called when catching up on missed messages (by pre_start()) or by the SocketModeHandler in 'live' mode"""
|
||||
if isinstance(message, dict):
|
||||
try:
|
||||
message = Message(message)
|
||||
except MessageIsUnwanted:
|
||||
return False
|
||||
|
||||
|
||||
self.logger.info(f"Handling message {message} ({len(message.urls)} urls)")
|
||||
|
||||
|
||||
if len(message.urls) > 1:
|
||||
self.say_substitute("Only the first url is being handled. Please send any subsequent url as a separate message", thread_ts=message.thread.slack_ts)
|
||||
|
||||
self.callback(message = message)
|
||||
|
||||
|
||||
def respond_channel_message(self, article, say=None):
|
||||
if say is None:
|
||||
say = self.say_substitute
|
||||
answers = article.slack_info
|
||||
if article.slack_ts == 0:
|
||||
self.logger.error(f"{article} has no slack_ts")
|
||||
else:
|
||||
self.logger.info("Skipping slack reply because it is broken")
|
||||
for a in []:
|
||||
# for a in answers:
|
||||
if a["file_path"]:
|
||||
try:
|
||||
self.client.files_upload(
|
||||
channels = config["archive_id"],
|
||||
initial_comment = f"{a['reply_text']}",
|
||||
file = a["file_path"],
|
||||
thread_ts = article.slack_ts_full
|
||||
)
|
||||
# status = True
|
||||
except SlackApiError as e: # upload resulted in an error
|
||||
say(
|
||||
"File {} could not be uploaded.".format(a),
|
||||
thread_ts = article.slack_ts_full
|
||||
)
|
||||
# status = False
|
||||
self.logger.error(f"File upload failed: {e}")
|
||||
else: # anticipated that there is no file!
|
||||
say(
|
||||
f"{a['reply_text']}",
|
||||
thread_ts = article.slack_ts_full
|
||||
)
|
||||
# status = True
|
||||
|
||||
|
||||
def startup_status(self):
|
||||
"""Prints an overview of the articles. This needs to be called here because it should run after having fetched the newly sent messages"""
|
||||
total = models.ArticleDownload.select().count()
|
||||
to_be_processed = models.ArticleDownload.select().where(models.ArticleDownload.title == "").count()
|
||||
unchecked = models.ArticleDownload.select().where(models.ArticleDownload.verified == 0).count()
|
||||
bad = models.ArticleDownload.select().where(models.ArticleDownload.verified == -1).count()
|
||||
not_uploaded = models.ArticleDownload.select().where(models.ArticleDownload.archive_url == "").count()
|
||||
self.logger.info(
|
||||
f"[bold]NEWS-FETCH DATABASE STATUS[/bold]: Total entries: {total}; Not yet downloaded: {to_be_processed}; Not yet checked: {unchecked}; Not yet uploaded to archive: {not_uploaded}; Marked as bad: {bad}",
|
||||
extra={"markup": True}
|
||||
)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
class BotRunner():
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
"""Stupid encapsulation so that we can apply the slack decorators to the BotApp"""
|
||||
def __init__(self, callback, *args, **kwargs) -> None:
|
||||
self.bot_worker = BotApp(callback, token=config["auth_token"])
|
||||
|
||||
@self.bot_worker.event(event="message", matchers=[is_message_in_archiving])
|
||||
def handle_incoming_message(message, say):
|
||||
return self.bot_worker.handle_incoming_message(message, say)
|
||||
|
||||
# @self.bot_worker.event(event="reaction_added", matchers=[is_reaction_in_archiving])
|
||||
# def handle_incoming_reaction(event, say):
|
||||
# return self.bot_worker.handle_incoming_reaction(event)
|
||||
|
||||
@self.bot_worker.event(event="event")
|
||||
def handle_all_other_reactions(event, say):
|
||||
self.logger.log("Ignoring slack event that isn't a message")
|
||||
|
||||
self.handler = SocketModeHandler(self.bot_worker, config["app_token"])
|
||||
|
||||
|
||||
def start(self):
|
||||
self.bot_worker.pre_start()
|
||||
self.handler.start()
|
||||
|
||||
|
||||
def stop(self):
|
||||
self.handler.close()
|
||||
self.logger.info("Closed Slack-Socketmodehandler")
|
||||
|
||||
|
||||
def send(self, article):
|
||||
"""Proxy function to send a message to the slack channel, Called by ArticleWatcher once the Article is ready"""
|
||||
self.bot_worker.respond_channel_message(article)
|
||||
|
||||
|
||||
|
||||
def is_message_in_archiving(message) -> bool:
|
||||
return message["channel"] == config["archive_id"]
|
||||
|
10
news_fetch/utils_storage/helpers.py
Normal file
10
news_fetch/utils_storage/helpers.py
Normal file
@ -0,0 +1,10 @@
|
||||
def clear_path_name(path):
|
||||
keepcharacters = (' ','.','_', '-')
|
||||
converted = "".join([c if (c.isalnum() or c in keepcharacters) else "_" for c in path]).rstrip()
|
||||
return converted
|
||||
|
||||
def shorten_name(name, offset = 50):
|
||||
if len(name) > offset:
|
||||
return name[:offset] + "..."
|
||||
else:
|
||||
return name
|
191
news_fetch/utils_storage/models.py
Normal file
191
news_fetch/utils_storage/models.py
Normal file
@ -0,0 +1,191 @@
|
||||
import logging
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
from peewee import *
|
||||
import os
|
||||
import markdown
|
||||
import configuration
|
||||
import datetime
|
||||
|
||||
from . import helpers
|
||||
config = configuration.main_config["DOWNLOADS"]
|
||||
slack_config = configuration.main_config["SLACK"]
|
||||
|
||||
# set the nature of the db at runtime
|
||||
download_db = DatabaseProxy()
|
||||
|
||||
|
||||
class DownloadBaseModel(Model):
|
||||
class Meta:
|
||||
database = download_db
|
||||
|
||||
|
||||
|
||||
## == Article related models == ##
|
||||
class ArticleDownload(DownloadBaseModel):
|
||||
# in the beginning this is all we have
|
||||
article_url = TextField(default = '', unique=True)
|
||||
|
||||
# fetch then fills in the metadata
|
||||
title = TextField(default='')
|
||||
@property
|
||||
def is_title_bad(self): # add incrementally
|
||||
return "PUR-Abo" in self.title \
|
||||
or "Redirecting" in self.title \
|
||||
or "Error while running fetch" in self.title
|
||||
|
||||
summary = TextField(default = '')
|
||||
source_name = CharField(default = '')
|
||||
language = CharField(default = '')
|
||||
|
||||
|
||||
file_name = TextField(default = '')
|
||||
@property
|
||||
def save_path(self):
|
||||
return f"{config['local_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/"
|
||||
@property
|
||||
def fname_nas(self, file_name=""):
|
||||
if self.download_date:
|
||||
if file_name:
|
||||
return f"NAS: {config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{file_name}"
|
||||
else: # return the self. name
|
||||
return f"NAS: {config['remote_storage_path']}/{self.download_date.year}/{self.download_date.strftime('%B')}/{self.file_name}"
|
||||
else:
|
||||
return None
|
||||
@property
|
||||
def fname_template(self):
|
||||
if "youtube.com" in self.source_name or "youtu.be" in self.source_name:
|
||||
fname = f"{self.source_name} -- {self.title}"
|
||||
else:
|
||||
fname = f"{self.source_name} -- {self.title}.pdf"
|
||||
return helpers.clear_path_name(fname)
|
||||
|
||||
|
||||
archive_url = TextField(default = '')
|
||||
pub_date = DateField(default = datetime.date.fromtimestamp(0))
|
||||
download_date = DateField(default = datetime.date.today)
|
||||
|
||||
slack_ts = FloatField(default = 0) # should be a fixed-length string but float is easier to sort by
|
||||
@property
|
||||
def slack_ts_full(self):
|
||||
str_ts = str(self.slack_ts)
|
||||
cut_zeros = 6 - (len(str_ts) - str_ts.find(".") - 1) # usually there a 6 decimals
|
||||
return f"{str_ts}{cut_zeros * '0'}"
|
||||
|
||||
sent = BooleanField(default = False)
|
||||
|
||||
archived_by = CharField(default = os.getenv("UNAME"))
|
||||
# need to know who saved the message because the file needs to be on their computer in order to get verified
|
||||
# verification happens in a different app, but the model has the fields here as well
|
||||
comment = TextField(default = '')
|
||||
verified = IntegerField(default = 0) # 0 = not verified, 1 = verified, -1 = marked as bad
|
||||
|
||||
# authors
|
||||
# keywords
|
||||
# ... are added through foreignkeys
|
||||
# we will also add an attribute named message, to reference which message should be replied to. This attribute does not need to be saved in the db
|
||||
|
||||
|
||||
## Helpers specific to a single article
|
||||
def __str__(self) -> str:
|
||||
if self.title != '' and self.source_name != '':
|
||||
desc = f"{helpers.shorten_name(self.title)} -- {self.source_name}"
|
||||
else:
|
||||
desc = f"{self.article_url}"
|
||||
return f"ART [{desc}]"
|
||||
|
||||
@property
|
||||
def slack_info(self):
|
||||
status = [":x: No better version available", ":gear: Verification pending", ":white_check_mark: Verified by human"][self.verified + 1]
|
||||
content = "\n>" + "\n>".join(self.summary.split("\n"))
|
||||
file_status, msg = self.file_status()
|
||||
if not file_status:
|
||||
return [msg]
|
||||
|
||||
# everything alright: generate real content
|
||||
# first the base file
|
||||
if self.file_name[-4:] == ".pdf":
|
||||
answer = [{ # main reply with the base pdf
|
||||
"reply_text" : f"*{self.title}*\n{status}\n{content}",
|
||||
"file_path" : self.save_path + self.file_name
|
||||
}]
|
||||
else: # don't upload if the file is too big!
|
||||
location = f"Not uploaded to slack, but the file will be on the NAS:\n`{self.fname_nas}`"
|
||||
answer = [{ # main reply with the base pdf
|
||||
"reply_text" : f"*{self.title}*\n{status}\n{content}\n{location}",
|
||||
"file_path" : None
|
||||
}]
|
||||
|
||||
# then the related files
|
||||
rel_text = ""
|
||||
for r in self.related:
|
||||
fname = r.related_file_name
|
||||
lentry = "\n• `{}` ".format(self.fname_nas(fname))
|
||||
if fname[-4:] == ".pdf": # this is a manageable file, directly upload
|
||||
f_ret = self.save_path + fname
|
||||
answer.append({"reply_text":"", "file_path" : f_ret})
|
||||
else: # not pdf <=> too large. Don't upload but mention its existence
|
||||
lentry += "(not uploaded to slack, but the file will be on the NAS)"
|
||||
|
||||
rel_text += lentry
|
||||
|
||||
if rel_text:
|
||||
rel_text = answer[0]["reply_text"] = answer[0]["reply_text"] + "\nRelated files:\n" + rel_text
|
||||
|
||||
return answer
|
||||
|
||||
@property
|
||||
def mail_info(self):
|
||||
base = [{"reply_text": f"[{self.article_url}]({self.article_url})\n", "file_path":None}] + self.slack_info
|
||||
return [{"reply_text": markdown.markdown(m["reply_text"]), "file_path": m["file_path"]} for m in base]
|
||||
|
||||
|
||||
def set_authors(self, authors):
|
||||
for a in authors:
|
||||
if len(a) < 100: # otherwise it's a mismatched string
|
||||
ArticleAuthor.create(
|
||||
article = self,
|
||||
author = a
|
||||
)
|
||||
|
||||
def set_related(self, related):
|
||||
for r in related:
|
||||
if len(r) > 255:
|
||||
raise Exception("Related file name too long for POSTGRES")
|
||||
|
||||
ArticleRelated.create(
|
||||
article = self,
|
||||
related_file_name = r
|
||||
)
|
||||
|
||||
def file_status(self):
|
||||
if not self.file_name:
|
||||
logger.error(f"Article {self} has no filename!")
|
||||
return False, {"reply_text": "Download failed, no file was saved.", "file_path": None}
|
||||
|
||||
file_path_abs = self.save_path + self.file_name
|
||||
if not os.path.exists(file_path_abs):
|
||||
logger.error(f"Article {self} has a filename, but the file does not exist at that location!")
|
||||
return False, {"reply_text": "Can't find file. Either the download failed or the file was moved.", "file_path": None}
|
||||
|
||||
return True, {}
|
||||
|
||||
|
||||
class ArticleAuthor(DownloadBaseModel):
|
||||
article = ForeignKeyField(ArticleDownload, backref='authors')
|
||||
author = CharField()
|
||||
|
||||
|
||||
class ArticleRelated(DownloadBaseModel):
|
||||
# Related files, such as the full text of a paper, audio files, etc.
|
||||
article = ForeignKeyField(ArticleDownload, backref='related')
|
||||
related_file_name = TextField(default = '')
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
def set_db(download_db_object):
|
||||
download_db.initialize(download_db_object)
|
||||
with download_db: # create tables (does nothing if they exist already)
|
||||
download_db.create_tables([ArticleDownload, ArticleAuthor, ArticleRelated])
|
@ -5,7 +5,7 @@ from pathlib import Path
|
||||
import logging
|
||||
logger = logging.getLogger(__name__)
|
||||
import configuration
|
||||
config = configuration.parsed["DOWNLOADS"]
|
||||
config = configuration.main_config["DOWNLOADS"]
|
||||
|
||||
shrink_sizes = []
|
||||
|
@ -8,7 +8,7 @@ from selenium import webdriver
|
||||
import configuration
|
||||
import json
|
||||
|
||||
config = configuration.parsed["DOWNLOADS"]
|
||||
config = configuration.main_config["DOWNLOADS"]
|
||||
blacklisted = json.loads(config["blacklisted_href_domains"])
|
||||
|
||||
|
||||
@ -25,10 +25,10 @@ class PDFDownloader:
|
||||
options.profile = config["browser_profile_path"]
|
||||
# should be options.set_preference("profile", config["browser_profile_path"]) as of selenium 4 but that doesn't work
|
||||
|
||||
if os.getenv("HEADLESS", "false") == "true":
|
||||
options.add_argument('--headless')
|
||||
if os.getenv("DEBUG", "false") == "true":
|
||||
self.logger.warning("Opening browser GUI because of 'DEBUG=true'")
|
||||
else:
|
||||
self.logger.warning("Opening browser GUI because of 'HEADLESS=false'")
|
||||
options.add_argument('--headless')
|
||||
|
||||
options.set_preference('print.save_as_pdf.links.enabled', True)
|
||||
# Just save if the filetype is pdf already
|
||||
@ -92,7 +92,7 @@ class PDFDownloader:
|
||||
|
||||
# in the mean time, get a page title if required
|
||||
if article_object.is_title_bad:
|
||||
article_object.title = self.driver.title.replace(".pdf", "")
|
||||
article_object.title = self.driver.title.replace(".pdf", "") # some titles end with .pdf
|
||||
# will be propagated to the saved file (dst) as well
|
||||
|
||||
fname = article_object.fname_template
|
||||
@ -112,7 +112,6 @@ class PDFDownloader:
|
||||
|
||||
if success:
|
||||
article_object.file_name = fname
|
||||
article_object.set_references(self.get_references())
|
||||
else:
|
||||
article_object.file_name = ""
|
||||
|
||||
@ -150,18 +149,6 @@ class PDFDownloader:
|
||||
return False
|
||||
|
||||
|
||||
def get_references(self):
|
||||
try:
|
||||
hrefs = [e.get_attribute("href") for e in self.driver.find_elements_by_xpath("//a[@href]")]
|
||||
except:
|
||||
hrefs = []
|
||||
# len_old = len(hrefs)
|
||||
hrefs = [h for h in hrefs \
|
||||
if not sum([(domain in h) for domain in blacklisted]) # sum([True, False, False, False]) == 1 (esp. not 0)
|
||||
] # filter a tiny bit at least
|
||||
# self.logger.info(f"Hrefs filtered (before: {len_old}, after: {len(hrefs)})")
|
||||
return hrefs
|
||||
|
||||
|
||||
|
||||
|
@ -53,10 +53,5 @@ def get_description(article_object):
|
||||
article_object.set_authors(news_article.authors)
|
||||
except AttributeError:
|
||||
pass # list would have been empty anyway
|
||||
|
||||
try:
|
||||
article_object.set_keywords(news_article.keywords)
|
||||
except AttributeError:
|
||||
pass # list would have been empty anyway
|
||||
|
||||
|
||||
return article_object
|
@ -7,15 +7,13 @@ class TemplateWorker(Thread):
|
||||
"""Parent class for any subsequent worker of the article-download pipeline. They should all run in parallel, thus the Thread subclassing"""
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def __init__(self, *args, **kwargs) -> None:
|
||||
def __init__(self, **kwargs) -> None:
|
||||
target = self._queue_processor # will be executed on Worker.start()
|
||||
group = kwargs.get("group", None)
|
||||
name = kwargs.get("name", None)
|
||||
|
||||
super().__init__(group=group, target=target, name=name)
|
||||
self.keep_running = True
|
||||
super().__init__(target=target, daemon=True)
|
||||
self._article_queue = []
|
||||
self.logger.info(f"Worker thread {self.__class__.__name__} initialized successfully")
|
||||
|
||||
|
||||
|
||||
def process(self, article_watcher):
|
||||
self._article_queue.append(article_watcher)#.article_model.article_url)
|
||||
@ -23,7 +21,7 @@ class TemplateWorker(Thread):
|
||||
|
||||
def _queue_processor(self):
|
||||
"""This method is launched by thread.run() and idles when self._article_queue is empty. When an external caller appends to the queue it jumps into action"""
|
||||
while True: # PLEASE tell me if I'm missing an obvious better way of doing this!
|
||||
while self.keep_running: # PLEASE tell me if I'm missing an obvious better way of doing this!
|
||||
if len(self._article_queue) == 0:
|
||||
time.sleep(5)
|
||||
else:
|
||||
@ -39,3 +37,10 @@ class TemplateWorker(Thread):
|
||||
article = article_watcher.article
|
||||
article = action(article) # action updates the article object but does not save the change
|
||||
article.save()
|
||||
article_watcher.update(self.__class__.__name__)
|
||||
|
||||
|
||||
def stop(self):
|
||||
self.logger.info(f"Stopping worker {self.__class__.__name__} whith {len(self._article_queue)} articles left in queue")
|
||||
self.keep_running = False
|
||||
self.join()
|
@ -25,7 +25,7 @@ class DownloadWorker(TemplateWorker):
|
||||
action = self.dl_runner
|
||||
|
||||
super()._handle_article(article_watcher, action)
|
||||
article_watcher.download_completed = True
|
||||
# article_watcher.download_completed = True
|
||||
|
||||
|
||||
|
||||
@ -36,7 +36,7 @@ class FetchWorker(TemplateWorker):
|
||||
def _handle_article(self, article_watcher):
|
||||
action = get_description # function
|
||||
super()._handle_article(article_watcher, action)
|
||||
article_watcher.fetch_completed = True
|
||||
# article_watcher.fetch_completed = True
|
||||
|
||||
|
||||
|
||||
@ -52,7 +52,7 @@ class UploadWorker(TemplateWorker):
|
||||
return run_upload(*args, **kwargs)
|
||||
|
||||
super()._handle_article(article_watcher, action)
|
||||
article_watcher.upload_completed = True
|
||||
# article_watcher.upload_completed = True
|
||||
|
||||
|
||||
|
||||
@ -63,4 +63,4 @@ class CompressWorker(TemplateWorker):
|
||||
def _handle_article(self, article_watcher):
|
||||
action = shrink_pdf
|
||||
super()._handle_article(article_watcher, action)
|
||||
article_watcher.compression_completed = True
|
||||
# article_watcher.compression_completed = True
|
Loading…
x
Reference in New Issue
Block a user