[Users] Poorly-wrapped text when viewed in external editor

Victoria Stuart (VictoriasJourney.com) mail at VictoriasJourney.com
Fri Dec 1 05:04:35 CET 2017


Hello everyone!  I returned to this issue yesterday as I now want to use my CM messages for NLP-related work.  Looking at the raw (saved) messages, I noticed (linux: file -i <file>) that they were "message/rfc822; charset=us-ascii" even though I work in UTF-8 and CM displays them correctly (e.g. Greek letters in scientific text). Noting that the headers contain

    MIME-Version: 1.0
    Content-Type: text/plain; charset=UTF-8
    Content-Transfer-Encoding: quoted-printable

I did a quick search and found two Linux utilities that I explored, that decode MIME-encoded text:

    munpack (part of the mpack package)
    uudeview

that enable you to process (unpack / decode) those messages.

I found the latter to be the most useful, as it allowed me to easily get the text of each of the CM files with a .txt extension, that I used in the processing.  I wrote a bash script, below, that will recursively grab CM files from a specified starting folder, and return the textual portions with an .out extension (optionally into a single output folder).

Since CM stores messages numerically (non-uniquely across directories), you can either output to the source directory, or rename the files.  I wanted to tag the file names with the parent directory name, anyway, so I just dump everything into a common output folder, with the files named with the parent folder, base name, and and .out extension.

Cheers, Victoria  :-)

# ============================================================================

#!/bin/bash

export LANG=C.UTF-8

# /mnt/Vancouver/Programming/scripts/claws/claws-decode.sh

# ----------------------------------------------------------------------------
# ENABLE SPACES IN FILENAMES:

## https://stackoverflow.com/questions/4638874/how-to-loop-through-a-directory-recursively-to-delete-files-with-certain-extensi

## To allow spaces in filenames,
##   at the top of the script include: IFS=$'\n'; set -f
##   at the end of the script include: unset IFS; set +f

IFS=$'\n'; set -f

# ----------------------------------------------------------------------------
# SET PATHS:

PWD=$(pwd)

OUT=$PWD/out
# https://stackoverflow.com/questions/793858/how-to-mkdir-only-if-a-dir-does-not-already-exist
mkdir -p $OUT

IN="/mnt/Vancouver/Programming/scripts/claws/claws_mail/"

# https://superuser.com/questions/716001/how-can-i-get-files-with-numeric-names-using-ls-command
# FILES=$(find $IN -type f -regex ".*/[0-9 ]*")     ## recursive; numeric filenames only

FILES=$(find $IN -type f -regex ".*/[0-9 ]*")       ## recursive; numeric filenames only (may include spaces)

# echo '$FILES:'                                    ## single-quoted, prints: $FILES:
# echo "$FILES"                                     ## double-quoted, prints path/, filename (one per line)

# ----------------------------------------------------------------------------
# MAIN LOOP:

for f in $FILES
    do
        bp=$(basename $(dirname "$f"))
        bn=$(basename "$f")

        # Output to $OUT dir:
        # https://askubuntu.com/questions/538913/how-can-i-copy-files-with-duplicate-filenames-into-one-directory-and-retain-both
        # uudeview -q -i -t $f; /usr/bin/cp -f --backup=existing -S .orig 0001.txt $OUT/$bp.$bn.out; rm -f 0001.txt
        uudeview -q -i -t $f; /usr/bin/cp -f --backup=simple -S .orig 0001.txt $OUT/$bp.$bn.out; rm -f 0001.txt

        # Output to source dir:
        # uudeview -q -i -t $f; /usr/bin/cp -f 0001.txt $f.out; rm -f 0001.txt
    done

# ----------------------------------------------------------------------------

unset IFS; set +f

# ----------------------------------------------------------------------------
# CLEAN UP:

## This must appear after the "unset IFS; set +f" line, above).

# ----------------------------------------
# Remove multiple extensions via 'brace expansion':

# https://stackoverflow.com/questions/10516384/how-to-delete-multiple-files-at-once-in-bash-on-linux
# rm -f *.{jpg, pdf, png, gif}
# Either of these work, but do NOT include spaces anywhere inside the braces:

#rm -fR $PWD/*.{jpg,pdf,png,gif}
rm -fR $PWD/*{.jpg,.pdf,.png,.gif}

# ----------------------------------------
# Rename extensions of renamed, duplicate files:

rename out.orig orig.out $OUT/*.orig

# ============================================================================



More information about the Users mailing list