Text-laundering (Working With Text 3)

January 28, 2018

Written by: Kerim

Ever copy and paste something that should be a solid paragraph of text, which should look like this:

Consuetudium lectorum Mirum est notare. Eodem modo typi qui nunc nobis videntur parum clari fiant sollemnes in futurum? Assum Typi non habent claritatem insitam est usus legentis in iis. Claritatem Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius Claritas est etiam. Nam liber tempor cum soluta. Est etiam processus dynamicus qui.

only to have it end up looking like this?

Consuetudium lectorum Mirum est notare.
Eodem modo typi qui nunc nobis videntur parum clari fiant sollemnes in futurum? Assum Typi non habent claritatem insitam est usus
legentis in iis.
Claritatem Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius Claritas est etiam. Nam liber tempor cum soluta. Est etiam
processus dynamicus qui.

Most word processors have a command that lets you see invisible markers like spaces (usually represented as a faint dot “•”) and what are still quaintly called “carriage returns,” or “line feeds” (generally shown by the symbols ”¶” or “↵”).¹ If you turn that feature on, you will see that there are way too many such return symbols in the above text. It might seem like the solution would be to find and replace all those returns with spaces, but then you would have no paragraphs at all in your document. What you want to do is replace all the mid-paragraph returns, but leave those between paragraphs.

Using Regular Expressions (RegEx), as discussed in the last post in this series, what we would want to do is search for every return (or line feed) that is not followed or preceded by a return (or line feed). In addition–since some paragraphs are separated not by a blank line but by a tab or sequence of spaces at the start of the new paragraph–we want to look for those as well. I find the following search works pretty well for me: (?<=[^\r\n\t ][^\r\n])\R(?=[^\r\n][^\r\n\t ]) It is easy to find many patterns like this in online forums, as I did, saving you the trouble of having to re-invent the wheel.

If you prefer not to have to muddle around with code, there are a number of tools out there which can automate this kind of text cleanup for you. On macOS my favorite is the package of free WordService menu extensions from DEVONtechnologies. These are extensions that work with the built-in “Services” menu that pops up on macOS whenever you control-click on some selected text. The package offers a number of useful commands to do things like change the capitalization of the selected text (e.g. turn “THE APPLE” into “The Apple,” or “The apple,” etc.), reformat line breaks (or remove them altogether), and one that can give you useful statistics such as the word or character count of the selected text, etc.

Considering that WordService is free and does pretty much the same thing, you might not want to spend $45 for TextSoap, but if you already have a subscription to the Setapp bundle of macOS apps then TextSoap is included with your subscription. Another option is Textwell which works on both macOS and iOS and can do much more than just clean text. It has some built in tools, much like those offered in WordService, but (if you aren’t afraid of tweaking the JavaScript in the example code) you also can make your own actions. I really like that these can be synced between the desktop and iOS. Clean Text for iOS is even easier to use, but less customizable. Since I don’t use Windows, Linux, or Android, etc. I’ll leave it for others to recommend their favorite text cleanup tools for those platforms in the comments.

List of posts in this series

Actually, there are significant differences between carriage returns and line feeds, but they aren’t important for this post. ↩

Kerim

P. Kerim Friedman is a professor in the Department of Ethnic Relations and Cultures at National Dong Hwa University in Taiwan. His research explores language revitalization efforts among indigenous Taiwanese, looking at the relationship between language ideology, indigeneity, and political economy. An ethnographic filmmaker, he co-produced the Jean Rouch award-winning documentary, ‘Please Don’t Beat Me, Sir!’ about a street theater troupe from one of India’s Denotified and Nomadic Tribes (DNTs).

kerim.oxus.net/

Text-laundering (Working With Text 3)

January 28, 2018

Written by: Kerim

List of posts in this series

Categories

Menu

Meta