[Users] Poorly-wrapped text when viewed in external editor
Victoria Stuart (VictoriasJourney.com)
mail at VictoriasJourney.com
Fri Dec 1 05:04:35 CET 2017
Hello everyone! I returned to this issue yesterday as I now want to use my CM messages for NLP-related work. Looking at the raw (saved) messages, I noticed (linux: file -i <file>) that they were "message/rfc822; charset=us-ascii" even though I work in UTF-8 and CM displays them correctly (e.g. Greek letters in scientific text). Noting that the headers contain
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
I did a quick search and found two Linux utilities that I explored, that decode MIME-encoded text:
munpack (part of the mpack package)
uudeview
that enable you to process (unpack / decode) those messages.
I found the latter to be the most useful, as it allowed me to easily get the text of each of the CM files with a .txt extension, that I used in the processing. I wrote a bash script, below, that will recursively grab CM files from a specified starting folder, and return the textual portions with an .out extension (optionally into a single output folder).
Since CM stores messages numerically (non-uniquely across directories), you can either output to the source directory, or rename the files. I wanted to tag the file names with the parent directory name, anyway, so I just dump everything into a common output folder, with the files named with the parent folder, base name, and and .out extension.
Cheers, Victoria :-)
# ============================================================================
#!/bin/bash
export LANG=C.UTF-8
# /mnt/Vancouver/Programming/scripts/claws/claws-decode.sh
# ----------------------------------------------------------------------------
# ENABLE SPACES IN FILENAMES:
## https://stackoverflow.com/questions/4638874/how-to-loop-through-a-directory-recursively-to-delete-files-with-certain-extensi
## To allow spaces in filenames,
## at the top of the script include: IFS=$'\n'; set -f
## at the end of the script include: unset IFS; set +f
IFS=$'\n'; set -f
# ----------------------------------------------------------------------------
# SET PATHS:
PWD=$(pwd)
OUT=$PWD/out
# https://stackoverflow.com/questions/793858/how-to-mkdir-only-if-a-dir-does-not-already-exist
mkdir -p $OUT
IN="/mnt/Vancouver/Programming/scripts/claws/claws_mail/"
# https://superuser.com/questions/716001/how-can-i-get-files-with-numeric-names-using-ls-command
# FILES=$(find $IN -type f -regex ".*/[0-9 ]*") ## recursive; numeric filenames only
FILES=$(find $IN -type f -regex ".*/[0-9 ]*") ## recursive; numeric filenames only (may include spaces)
# echo '$FILES:' ## single-quoted, prints: $FILES:
# echo "$FILES" ## double-quoted, prints path/, filename (one per line)
# ----------------------------------------------------------------------------
# MAIN LOOP:
for f in $FILES
do
bp=$(basename $(dirname "$f"))
bn=$(basename "$f")
# Output to $OUT dir:
# https://askubuntu.com/questions/538913/how-can-i-copy-files-with-duplicate-filenames-into-one-directory-and-retain-both
# uudeview -q -i -t $f; /usr/bin/cp -f --backup=existing -S .orig 0001.txt $OUT/$bp.$bn.out; rm -f 0001.txt
uudeview -q -i -t $f; /usr/bin/cp -f --backup=simple -S .orig 0001.txt $OUT/$bp.$bn.out; rm -f 0001.txt
# Output to source dir:
# uudeview -q -i -t $f; /usr/bin/cp -f 0001.txt $f.out; rm -f 0001.txt
done
# ----------------------------------------------------------------------------
unset IFS; set +f
# ----------------------------------------------------------------------------
# CLEAN UP:
## This must appear after the "unset IFS; set +f" line, above).
# ----------------------------------------
# Remove multiple extensions via 'brace expansion':
# https://stackoverflow.com/questions/10516384/how-to-delete-multiple-files-at-once-in-bash-on-linux
# rm -f *.{jpg, pdf, png, gif}
# Either of these work, but do NOT include spaces anywhere inside the braces:
#rm -fR $PWD/*.{jpg,pdf,png,gif}
rm -fR $PWD/*{.jpg,.pdf,.png,.gif}
# ----------------------------------------
# Rename extensions of renamed, duplicate files:
rename out.orig orig.out $OUT/*.orig
# ============================================================================
More information about the Users
mailing list