Text files
Text files

Text files are sequences of UTF-8 (Unicode Transformation Format-8) characters without markup. They are editable by simple system editors and are highly transportable. Although without markup, the files have a fixed format described below to facilitate machine processing. Text files lack Documentary Hypothesis markings.

Text files can be obtained in one of 4 ways:

  1. When viewing biblical text press the "Text" button at the bottom of a text page. The file will be displayed by the browser and can be saved via the browser's "Save Page as ..." command. The text file will have the Layout and Content of the original text. (The font and font size are not in the file and are browser-dependent.)
  2. Text files for entire biblical books can be obtained by clicking on the book name on the Home page. The middle row of the resulting page offers a table of available formats. Click on the "Text" item to view the entire book in text format.
  3. Text files can also be obtained from the Server.txt by entering a query URL into a browser address bar. Only 2 parameters, layout, content, each having values from the pulldown lists are permitted. For example,
    https://tanach.us/Server.txt?Deut26:5-9&layout=Qere-only&content=Consonants
    displays Deuteronomy 26:5-9 in Qere-only Layout with Consonants Content. Note that layout and content are not capitalized, although Qere-only and Consonants are capitalized. See the Servers page for more information about using the Server.txt.
  4. Zipped archives of complete Tanach books in text format, Tanach.txt.zip, are available in the TextFiles directory. See the "Zipped archives of Tanach books" section of the Technical page.

Text file applications:

Because the files are without markup, the display of the text files is completely dependent on the implementation of the Unicode bi-directional (bidi) text algorithm in the displaying application. Although Unicode has been a web standard for a long time, many applications do not have full implementations of the bidi algorithm. Not all applications reliably display Unicode text files that contain both left-to-right and right-to-left text. Fortunately, the files appear correctly on all systems' default editor and many other applications:

System Editor Font Comments
Windows
Windows 10
Notepad Taamey D Web Do not use WordPad!
Mac OSX
10.6.3+
TextEdit Raanana  
Ubuntu Linux
9.10+
gedit Frank Ruehl CLM  
SUSE Linux
11.1+
KWrite Frank Ruehl CLM  

Even if the display is distorted by an application's shortcomings in bidi implementation, machine processing of the files will not be effected.

Format:

Unicode contains three helpful directional characters, the left-to-right embedding (LRE, 202a), the right-to-left embedding (RLE, 202b) and the pop directional formating (PDF, 202c) to set the text direction for the next block of characters. The text files have been formatted with LRE, RLE, and PDF characters. Applications that fail to handle these characters correctly will produce erroneous displays.

Lines consist of either an LRE or RLE character followed by a sequence of Unicode characters terminated by a PDF character, a carriage return (000d), and a linefeed (000a). By this approach each line starts with a directional specification and then 'escapes' this direction at the end of the line. Thus there is no change in embedding level from line to line.

Lines containing labels or blanks begin with the LRE character and then the prefix xxxx, followed by the text, and then are terminated by a PDF character, a carriage return (000d), and a linefeed (000a). Blank lines and a line with the chapter number are inserted between chapters for ease of reading.

All Hebrew text lines begin with an RLE and follow this order:

  1. The initial RLE,
  2. one (1) non-breaking space (00a0),
  3. If the Layout is "Full" or "Note-free":

  4. the verse number in a three-digit-wide field padded with non-breaking spaces (00a0),
  5. a sof pasuq '׃‎' (05c3, acting in place of a colon),
  6. the chapter number in a three-digit-wide field padded with non-breaking spaces (00a0),
  7. "Text-only" and "Qere-only" Layouts omit the above 3 fields.

  8. one (1) non-breaking space (00a0), and
  9. the Hebrew text with transcription notes (see below), and
  10. the line termination of consisting of a PDF character, a carriage return (000d), and a linefeed (000a).
In parsing the text, the first number found is the verse number, not the chapter number.

Hebrew text lines may contain transcription notes depending on the selected Layout. These notes are denoted by values within square brackets, i.e. [x]. Because the square brackets and possibly the note itself may be left-to-right characters, all notes are preceded by an LRE character and followed by a PDF character. Thus there is no change in embedding level after entering and leaving a transcription note. Except in the Qere-only Layout, ketiv/qere sections are marked with '*'s as in the original Michigan-Claremont coding.

The text files with "Full" Layout and "Accent" Content contain all the information available in the Unicode/XML Leningrad Codex. The ordering of cantillation marks may differ from that of the Unicode/XML Leningrad Codex, however. The mark ordering is that suggested by John Hudson in his SBL Hebrew Font User Manual.

  27.2