【机翻】可视化CCITTG3和G4对TIFF文档并转换为行程长度压缩格式,实现压缩域中的直接处理—(CMS 2016)

International Conference on Computational Modeling and Security (CMS 2016)
Visualizing CCITT Group 3 and Group 4 TIFF Documents and Transforming to Run-Length Compressed Format Enabling Direct Processing in Compressed Domain
Mohammed Javed *a , Krishnanand S.H. a , P. Nagabhushan a , B. B. Chaudhuri b
a Department of Studies in Computer Science, University of Mysore, Mysore, India
b CVPR Unit, Indian Statistical Institute, Kolkata, India
国际计算建模与安全会议(CMS 2016)
可视化CCITT第3组和第4组TIFF文档并转换为行程长度压缩格式,实现压缩域中的直接处理
Mohammed Javed *a , Krishnanand S.H. a , P. Nagabhushan a , B. B. Chaudhuri b
印度迈索尔市迈索尔大学计算机科学研究系
印度加尔各答印度统计研究所CVPR股
【参考】:T.4 : Standardization of group 3 facsimile apparatus for document transmission (itu.int)
摘要(Abstract)
Compression of data could be thought of as an avenue to overcome Big data problem to a large extent particularly to combat the storage and transmission issues. In this context, documents, images, audios and videos are preferred to be archived and communicated in the compressed form. However, any subsequent operation over the compressed data requires decompression which implies additional computing resources. Therefore developing novel techniques to operate and analyze directly the contents within the compressed data without involving the stage of decompression is a potential research issue. In this context, recently in the literature of Document Image Analysis (DIA) some works have been reported on direct processing of run-length compressed document data specifically targeted on CCITT Group 3 1-D documents. Since, run-length data is the backbone of other advanced compression schemes of CCITT such as CCITT Group 3 2-D (T.4) and CCITT Group 4 2-D (T.6) which are widely supported by TIFF and PDF formats, the proposal in this paper is to intelligently generate the run-length data from the compressed data of T.4 and T.6, and thus extend the idea of direct processing of documents in Run-Length Compressed Domain (RLCD). The generated run-length data from the proposed algorithm is experimentally validated and 100% correlation is reported with a data set of compressed documents. In the end, text segmentation and word spotting application in RLCD is also demonstrated.
数据压缩可以被认为是在很大程度上克服大数据问题的一种途径,特别是解决存储和传输问题。在这种情况下,文档、图像、音频和视频最好以压缩的形式存档和通信。然而,对压缩数据的任何后续操作都需要解压缩。这意味着额外的计算资源。因此,开发新的技术来直接操作和分析压缩数据中的内容,而不涉及解压缩阶段,是一个潜在的研究问题。在这种情况下,最近在文档图像分析(DIA)的文献中,已经报道了一些关于直接处理行程长度压缩文档数据的工作,这些数据专门针对CCITT Group 3 1-D文档。由于行程长度数据是CCITT的其他高级压缩方案的骨干,如CCITT Group 3 2-D(T.4)和CCITT Group 4 2-D(T.6),这些方案得到了TIFF和PDF格式的广泛支持,因此本文的建议是从T.4和T.6的压缩数据智能地生成行程长度数据,从而扩展了在Run Length Compressed Domain中直接处理文档的思想(RLCD)。通过实验验证了所提出算法生成的行程长度数据,并报告了与压缩文档数据集的100%相关性。最后,还演示了文本分割和分词在RLCD中的应用。
©2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Organizing Committee of CMS 2016
©2016作者。由Elsevier B.V.出版。这是一篇基于CC by-NC-ND许可证的开放获取文章(http://creativecommons.org/licenses/by-nc-nd/4.0/)
CMS 2016组委会负责的同行评审
Keywords: Run-length compressed domain processing, Run-length data, Modified Huffman(MH), Modified Read(MR), Modified Modified Read(MMR)
关键词:游程长度压缩域处理,游程长度数据,改良霍夫曼(MH),改良读取(MR),改良改良改良,读取(MMR)
*通讯作者:Mohammed Javed,电话:+919741161929;E-mail:javedsolutions@gmail.com
1.简介(Introduction)
In today's digital era,Data compression is the technique generally employed to overcome the volume aspects of the Big data. In fact, on daily basis this results in large scale of compressed data being stored and transferred in the compressed formats. On the contrary, as generally witnessed, any operation or analytics over the compressed data is executed after decompression. If this reversing stage of decompression could be avoided and the analytics could be carried out directly in the compressed version, then it will be an additional breakthrough. Towards this, deeper understanding of the nature of the compression would provide some useful clues. Recently, this novel idea of operating directly over the compressed data has attracted many researchers and as a result latest books and research papers on compressed domain techniques1,2,3,4 on texts, images and videos have been published. The Document Image Analysis (DIA) community is yet to gain thrust in the area.
在当今的数字领域,数据压缩是通常用来克服大数据的体积方面的技术。事实上,每天都会有大量压缩数据以压缩格式存储和传输。相反,正如通常所见,对压缩数据的任何操作或分析在解压缩之后执行。如果可以避免这种反向解压缩阶段,并且可以直接在压缩版本中进行分析,那么这将是一个额外的突破。为此,对压缩性质的更深入理解将提供一些有用的线索。最近,这种直接在压缩数据上操作的新颖想法吸引了许多研究人员,因此,关于文本、图像和视频的压缩域技术1、2、3、4的最新书籍和研究论文已经出版。文档图像分析(DIA)社区尚未在该地区获得影响力。
In the literature of DIA, there have been a few initial attempts to explore the possibility of operating directly over the compressed formats such as CCITT Group 31,5,6,7 CCITT Group 43,8, 9 JPEG4 and JBIG10. However, the proposed methods and operations are limited to a particular compressed format. In the recent literature, lot of interesting and deeper works like feature extraction1,5,11, page segmentation6,12, text segmentation1,7, font size detection11, etc have been reported on the run-length compressed data of CCITT Group 3 1-D compressed documents in Run-Length Compressed Domain(RLCD). Incidently, the other advanced compression schemes of CCITT are also based on Run-Length Encoding (RLE) technique. Based on the variations in the RLE encoding process, CCITT (International Telegraph Telephone Consultative Committee) has introduced a series of compression standards and transfer protocols for black and white images over telephone lines and data networks13, 14. They are popularly known as CCITT Group 3 1-D(MHModified Huffman), CCITT Group 3 2-D (MR-Modified Read) and Group 4 2-D (MMR-Modified Modified Read). These compression algorithms are widely supported by TIFF and PDF formats for handling printed and handwritten text documents. CCITT Group 3 contains synchronization codes and hence was developed for network communications, whereas CCITT Group 4 was designed for archival purpose, applicable in large databases because of its high compression ratio. Overall, it can be observed that the run-length compressed data is the backbone of CCITT compression schemes. Therefore, the proposal in the research paper is to extend the idea of directly operating on compressed documents in RLCD to advanced compression schemes like MR and MMR by intelligently generating run-length code. Towards this purpose, a novel algorithm is proposed in this paper.
在DIA的文献中,已经有一些初步的尝试来探索直接在压缩格式上操作的可能性,例如CCITT组31、5、6、7CCITT组43、8、9 JPEG4和JBIG10。然而所提出的方法和操作限于特定的压缩格式。在最近的文献中,有很多有趣和深入的工作,如特征提取1,5,11,页面分割6,12,文本分割1,7,字体大小已经报道了对运行长度压缩域(RLCD)中的CCITT Group 3 1-D压缩文档的运行长度压缩数据的检测11等。偶然的是CCITT也是基于游程长度编码(RLE)技术。基于RLE编码过程的变化,CCITT(国际电报电话咨询委员会)引入了一系列用于电话线和数据网络上的黑白图像的压缩标准和传输协议13,14。它们通常被称为CCITT Group 3 1-D(MHModified Huffman)、CCITT Group 3-2-D(MR Modified Read)和Group 4-2-D(MMR Modified Modified Read。TIFF和PDF格式广泛支持这些压缩算法,用于处理打印和手写文本文档。CCITT第3组包含同步代码,因此是为网络通信而开发的,而CCITT第4组是为存档目的而设计的,由于其高压缩比,适用于大型数据库。总体而言,可以观察到,行程长度压缩数据是CCITT压缩方案的骨干。因此,本文的建议是通过智能生成游程码,将RLCD中直接对压缩文档进行操作的想法扩展到MR和MMR等高级压缩方案。为此,本文提出了一种新的算法。
In this backdrop, the proposed research paper aims at (i) getting deeper understanding of the compressed data of RLE flavored advanced compression schemes like MH, MR and MMR of CCITT, (ii) transforming the compressed data of MR and MMR to Run-length data, and (iii) demonstrating direct operations and analytics on the generated run-length compressed data.
在这种背景下,所提出的研究论文旨在(i)更深入地理解RLE风格的高级压缩方案(如CCITT的MH、MR和MMR)的压缩数据,(ii)将MR和MMR的压缩数据转换为游程数据,以及(iii)演示对生成的行程长度压缩数据。
Rest of the paper is organized as follows. Section 2 is dedicated for discussing background information related to this research work such as TIFF data format, MH, MR, MMR encoding schemes from the perspective of compressed domain processing. Section 3 demonstrates visualization of TIFF compressed data, section 4 introduces the novel algorithm of transforming MR and MMR compressed data to run-length data and subsequently discusses the Run-Length Compressed Domain processing. Section 5 reports experimental results and section 6 summarizes the research work.
本文的其余部分组织如下。第2节专门从以下角度讨论与本研究工作相关的背景信息,如TIFF数据格式、MH、MR、MMR编码方案压缩域处理。第3节演示TIFF的可视化。
2. 背景(Background)
2.1. 结构(Structure)
TIFF15 is a graphical format which stands for Tagged Image File Format and a typical TIFF file organization is shown in Fig-1. In the figure IFH stands for Image File Header, Bitmap data actually contains the black and white pixels data either in raw or compressed form, IFD stands for Image File Directory, and EoB indicates End of Byte.
TIFF15是一种图形格式,代表标记图像文件格式,典型的TIFF文件组织是如图1所示。图中IFH代表图像文件头,位图数据实际上包含黑色和白色原始或压缩形式的像素数据,IFD表示图像文件目录,EOB表示字节结束。

图1 TIFF的文件组织
A TIFF file always begins with an 8-byte IFH that points to an IFD which is shown in Fig-2. In the figure, the first two bytes indicate the byte order, where 4949H in hexadecimal notation represents litte-endian and 4D4DH indicates big-endian order. The next two bytes 002AH identifies the file type which in this case is a TIFF file. The last four bytes 00000012H indicate the offset value of the first IFD in bytes. An IFD inside a TIFF file gives information about the specific tag associated with the image. This specific tag information will be used during the decoding process. A general structure of an IFD is shown in Fig-3, which is made up of different fields totally constituting of 12 byte data.There are nearly 14 IFD's15 associated with a TIFF image. Image Width, ImageLength, Compression,X Resolution,T Resolution are few examples of TIFF IFD's.A sample IFD for the tag ImageWidth is shown in Fig-4.
TIFF文件总是以8字节IFH开头,该IFH指向IFD,如图2所示。在图中前两个字节表示字节顺序,其中十六进制表示法中的4949H表示litte endian,4D4DH表示big endian顺序。接下来的两个字节002AH标识文件类型,在本例中为TIFF文件。这个
最后四个字节00000012H指示以字节为单位的第一IFD的偏移值。TIFF文件中的IFD提供有关与图像相关联的特定标记的信息。该特定标签信息将在解码过程中使用。IFD的一般结构如图3所示,它由完全由12字节数据组成的不同字段组成。有近14个IFD15与TIFF图像相关。图像宽度、图像长度、压缩、X分辨率、Y分辨率是TIFF IFD的几个例子。标签ImageWidth的IFD样本如图4所示。

图2 图像文件头
Further, the IFD tag Compression in a TIFF file internally supports different compression algorithms, and based on the type of data the compression algorithms are selected. The type of compression employed to the image data is identified by a special tag number 0103H. A general IFD structure for the Compression tag is given in Fig-3. In the figure, the byte numbers 9 and 10 indicate the type of compression algorithm used. In Fig-3, the value 0001H indicate 'No Compression', whereas the presence of values such as 0002H,0003H and 0004H inrespectivly indicate MH, MR and MMR compression schemes. Therefore based on the compression scheme indicated in the Compression tag, the compressed data is utilized for the proposed compressed domain processing.
此外,TIFF文件中的IFD标记Compression在内部支持不同的压缩算法,并根据数据类型选择压缩算法。对图像数据采用的压缩类型由特殊标签号0103H标识。压缩标签的通用IFD结构如图3所示。在图中,字节号9和10表示所使用的压缩算法的类型。在图3中,0001H值表示“无压缩”,而0002H、0003H和0004H等值的存在分别表示MH、MR和MMR压缩方案。因此,基于压缩标签,压缩数据被用于所提出的压缩域处理。
图3 无压缩图像的IFD
Text contents are very common in documents such as research articles, newspapers and magazines. Moreover, text carries important information of the document and can be losslessly represented and reproduced using a black and white image. To compress the contents of black and white images, the popular image compression formats like TIFF and PDF widely support three compression schemes namely MH, MR and MMR which represent different flavors of RLE. Therefore the upcoming subsections are dedicated to discuss the working model of MH, MR and MMR from the perspective of compressed data processing. The study presented here is to project the presence of RLE backbone in these compressed formats and hence to get an avenue to transform these codes into RLCD. However, a detailed discussion regarding the compression schemes is available in the works13,14,15.
文本内容在研究文章、报纸和杂志等文献中非常常见。此外,文本携带文档的重要信息,并且可以使用黑白图像无损地表示和再现。为了压缩黑白图像的内容,流行的图像压缩格式,如TIFF和PDF,广泛支持三种压缩方案,即MH、MR和MMR,它们代表了不同风格的RLE。因此,接下来的小节专门从压缩数据处理的角度讨论MH、MR和MMR的工作模型。这里提出的研究是为了预测RLE主干在这些压缩格式中的存在,从而获得将这些代码转换为RLCD的途径。然而,关于压缩方案的详细讨论可在工程13,14,15中进行。
Modified Huffman (MH) or CCITT Group 3 1-D encoding is a variation of the Huffman compression algorithm13. A binary image is made up of a series of black and white pixel runs of variable lengths. The MH encoder scans the black(0) and white(1) pixel runs line by line and outputs a variable-length binary code word representing the run-length and run-color from a standard predefined Huffman table. This standard table representing runs of black and white pixels is part of the T.413 specification. The table is used for encoding and decoding all CCITT Group 3 data. The output code word is normally shorter than the input pixel data, and hence the compression is achieved.
改进的霍夫曼(MH)或CCITT Group 3 1-D编码是霍夫曼压缩算法13的变体。二进制图像由一系列长度可变的黑白像素组成。MH编码器逐行扫描黑(0)和白(1)像素游程,并从标准预定义的霍夫曼表输出表示游程长度和游程颜色的可变长度二进制码字。这个表示黑白像素运行的标准表是T.413规范的一部分。该表用于对所有CCITT第3组数据进行编码和解码。输出码字通常比输入像素数据短,因此实现了压缩。
2.2.改良霍夫曼(Modified Huffman (MH))
In MH encoding each run length is encoded using two codes, namely Makeup and Terminating codes. Every encoded pixel run is a combination of zero or more Makeup code words and subsequently followed by a Terminating code word. The shorter runs are represented by Terminating code words and the longer runs by Makeup code words. There exist separate predefined tables for Terminating and Makeup code words for both black and white runs. Pixel runs with a length varying from 0 to 63 are represented using a single Terminating code word. Runs with a length between 64 to 2623 pixels are encoded using a single makeup code and a terminating code. Runs with a length greater than 2623 pixels will be encoded using one or more makeup codes and a terminating code. The overall run length is the sum of the run-length values encoded by each code word.
在MH编码中,每个行程长度使用两个代码进行编码,即补充代码和终止代码。每个编码的像素运行是零个或多个补偿码字的组合,随后是终止码字。较短的运行由Terminating码字表示,较长的运行由Makeup码字表示。对于黑色和白色运行,有单独的预定义表用于终止和补充码字。使用单个终止码字来表示长度从0到63变化的像素行程。长度在64到2623像素之间的游程使用单个组成码和终止码进行编码。长度大于2623像素的游程将使用一个或多个补码和终止码进行编码。总游程长度是由每个码字编码的游程长度值的总和。
Consider the examples shown in Fig-4. A run of 22 black pixels will be encoded by the terminating code for a black run length of 22 (code word 01011001 obtained from standard table15). This reduces a 22-bit run to the size of an 8-bit code word, a compression ratio of 3 : 1. This is illustrated in Fig-6a. Further a white run of 84 pixels will be represented using the makeup code for a white run length of 64 pixels followed by the terminating code for a white run length of 20 pixels (64 + 20 = 84). This encoding reduces 84 bits to 12 bits, or a compression ratio of 7 : 1. This is illustrated in Fig-6b. A run of 10000 white pixels would be encoded as three makeup codes of 2560 white pixels (7680 pixels), a makeup code of 2304 white pixels, followed by the terminating code for 16 white pixels (2560 + 2560 + 2560 + 2304 + 16 = 10000). In this case 8800 run-length bits are encoded into five code words with a total length of 54 bits, for an approximate compression ratio of 185 : 1. This is illustrated in Fig-4c.
考虑图4所示的示例。22个黑色像素的行程将由终止码编码,黑色行程长度为22(从标准表15获得的码字01011001)。这将22位的运行减少到8位码字的大小,压缩比为3∶1。如图6a所示。此外,84个像素的白游程将使用用于64个像素的白游程长度的构成码来表示,随后是用于20个像素的白色游程长度(64+20=84)的终止码。这种编码将84个比特减少到12个比特,或者压缩比为7∶1。如图6b所示。10000个白色像素的行程将被编码为2560个白色像素(7680个像素)的三个组成码,2304个白色像素组成码,然后是16个白色像素终止码(2560+2560+2560=2304+16=10000)。在这种情况下,8800个游程长度比特被编码成五个码字,总长度为54比特,近似压缩比为185∶1。如图4c所示。
图4 改进的霍夫曼编码
In MH encoding process, all scan lines are conventionally designed to always begin with a white run-length code word (because in most of the document images scan lines begin with a white space or run). In case a scan line begins with a black run, a white run-length code word of zero length will be added at the beginning of the actual scan line code. An EOL stands for End Of Line code, which is a 12-bit code word that begins every line in a Group 3 transmission. This code word is used to identify the start or end of a scan line during the transmission stage. The decoder uses EOL codes to detect the width of a decoded scan line, and also to keep track of the number of scan lines in an image. This is because if any short image is detected, it pads the remaining length with scan lines of all white pixels.
在MH编码过程中,所有扫描线通常被设计为总是以白色游程长度码字开始(因为在大多数文档图像中,扫描线以空白或游程开始)。在扫描线以黑色游程开始的情况下,将在实际扫描线代码的开头添加长度为零的白色游程长度码字。EOL代表线路结束码,这是一个12位码字,从第3组传输中的每一条线路开始。该码字用于识别传输阶段期间扫描线的开始或结束。解码器使用EOL码来检测解码的扫描线的宽度,并且还跟踪图像中的扫描线数量。这是因为,如果检测到任何短图像,它会用所有白色像素的扫描线填充剩余长度。
Further, RTC (Return To Control) code is used to terminate Group 3 message transmissions and is added to the end of every Group 3 data stream. The RTC code signal consists of simply six consecutive EOL's and this indecates the end of message transmission. The RTC signal is not actually the part of the encoded message data but actually part of the facsimile protocol. A FILL code word is a run of one or more zero bits that appear between the encoded scan line data and the EOL code. The FILL bits help to pad out the length of an encoded scan line to compensate the transmission time of the line to a required length.
此外,RTC(返回控制)代码用于终止第3组消息传输,并添加到每个第3组数据流的末尾。RTC代码信号仅由六个连续EOL组成,这表示消息传输结束。RTC信号实际上不是编码消息数据的一部分,而是传真协议的一部分。FILL码字是出现在编码扫描线数据和EOL码之间的一个或多个零位的游程。FILL位有助于填充编码扫描线的长度,以将线的传输时间补偿到所需的长度。
2.3.修改的读取(Modified Read (MR))
The MR or CCITT Group 3 2D coding scheme is a line-by-line coding method. The important definitions associated with MR coding scheme are reproduced below13.
Changing element : In a scan line, an element whose color is different from that of the previous element
Reference element : An element whose position determines a coding mode
Coding mode : A method to code the position of each changing element along the coding line
Coding line : The current scan line Reference line : The previous scan line
MR或CCITT Group 3 2D编码方案是逐行编码方法。与MR编码方案相关的重要定义如下13:
变化元素:在扫描线中,一个元素的颜色与前一个元素不同
参考元素:其位置决定编码模式的元素
编码模式:一种对每个变化元素沿编码线的位置进行编码的方法
编码行:当前扫描行
参考线:上一条扫描线
In MR compression scheme13, the position of each changing element on the coding line is encoded with respect to the position of a corresponding reference element. The reference element may be located either on the coding line or on the reference line. After the coding process is over, the current coding line becomes the reference line for the next coding line. In MR, to limit the facsimile transmission error, a MH coded line is generally sent at regular intervals which are referred to as K factor. For a standard facsimile, the value of K is equal to 2,and at the higher resolution K is equal to 4. The value of K for a digital image can be of any positive non-zero integer. MR compression scheme implements Group 3 encoding without using the EOL or RTC code words. Also while writing Group 3 data to an image file, the initial 12-bit EOL, the 12 EOL bits per scan line, and the 72 RTC bits affixed onto the end of each image are not used. Overall, for every K lines, the CCITT Group 3 2-D scheme encodes first line in 1-D MH coding and the other K -- 1 lines in 2-D coding.
在MR压缩方案13中,编码线上的每个变化元素的位置相对于对应的参考元素的位置进行编码。参考元件可以位于编码线上或者位于参考线上。在编码过程结束后,当前编码线成为下一个编码线的参考线。在MR中,为了限制传真传输错误,MH编码线路通常以被称为K因子的规则间隔发送。对于标准传真,K的值等于2,在较高分辨率下,K等于4。数字图像的K的值可以是任何正的非零整数。MR压缩方案在不使用EOL或RTC码字的情况下实现第3组编码。此外,在将第3组数据写入图像文件时,不使用初始的12位EOL、每条扫描线的12位EOL和附着在每个图像的末尾的72位RTC。总的来说,对于每K行,CCITT Group 3 2-D方案以1-D MH编码对第一行进行编码,而以2-D编码对其他K-1行进行编码。
In the CCITT Group 3 2-D13 coding there are 5 changing elements defined which are given below,
a0: the reference element on the coding line
a1: the next changing element to the right of a0 on the coding line.
a2: the next changing element to the right of a1 on the coding line.
b1: the next changing element on the reference line to the right of a0 and of inverse color of a0.
b2: the next changing element to the right of b1 on the reference line.
在CCITT Group 3 2-D13编码中,定义了5个变化元素:
a0:编码线上的参考元素
a1:编码线上a0右边的下一个变化元素。
a2:编码线上a1右边的下一个变化元素。
b1:a0右边的参考线上的下一个变化元素,a0的颜色相反。
b2:参考线上b1右边的下一个变化元素。
At the beginning of coding process, the changing element a0 is first set on imaginary white changing element located just before the first element on the coding line. During the encoding process, the position of a0 is determined by the previous coding mode. The changing elements for a sample reference and coding line is shown in Fig-7.
在编码过程开始时,首先在位于编码线上第一个元素之前的假想白色变化元素上设置变化元素a0。在编码过程中,a0的位置由先前的编码模式确定。样本参考和编码行的变化元素如图7所示。
Fig. 5. Changing elements in MR/MMR encoding
图5 MR/MMR编码中不断变化的元素
In CCITT T.4 standard, there are 3 coding modes: Pass Mode (P), Vertical Mode (V), and Horizontal Mode (H). Based on the location of a changing element along the coding line, the appropriate coding mode is selected. Pass Mode: when the position of b2 lies to the left of a2. Vertical Mode: when the relative distance between a1 and b1 is less than or equal to 3. Horizontal Mode: when neither pass mode nor vertical mode occur. In the Vertical Mode, depending on the relative distance between a1 and b1, there are seven possible cases. The V(0) implies a1 just under b1, VR(1) indicates that a1 is one pixel to the right of b1. Similarly other cases are shown in Table-1. In the table M(ai-aj) represents the code words of 1-D compression standard for the run ai-aj.
在CCITT T.4标准中,有3种编码模式:通过模式(P)、垂直模式(V)和水平模式(H)。基于沿着编码线的变化元件的位置,选择适当的编码模式。
通过模式:当b2的位置位于a2的左侧时。
垂直模式:当a1和b1之间的相对距离小于或等于3时。
水平模式:当既没有通过模式也没有垂直模式出现时。在垂直模式中,根据a1和b1之间的相对距离,有七种可能的情况。V(0)表示a1刚好在b1之下,VR(1)表示a1是b1右边的一个像素。类似的其他情况如表1所示。在表中,M(ai-aj)表示用于运行ai-aj的1-D压缩标准的码字。
2.4.再修改的读取(Modified Modified Read (MMR))
The CCITT Group 4 2-D14 coding scheme is known as the Modified Modified Relative element address designate code (MMR). The coding process in MMR is similar to MR, where the position of changing element along the coding line is encoded with reference to the position of a corresponding reference element which may be on coding line or the reference line. The reference line is present immediately above the coding line. The current coding line becomes the reference line after the coding process of the current coding line. The reference line for the first scan line of a page is an imaginary white line. Overall, the coding scheme is very much similar to MR, except that the MMR does coding of the first line differently and unlike MR it avoids coding of every Kth line (K = 2 or 4) of image data in MH mode.
CCITT Group 4 2-D14编码方案被称为Modified Modified Relative元素地址指定码(MMR)。MMR中的编码过程类似于MR,其中沿编码线是参考对应的参考元件的位置进行编码的,该参考元件可以在编码线上或参考线上。参考线位于编码线的正上方。在当前编码线的编码处理之后,当前编码线成为参考线。页面的第一扫描线的参考线是假想的白线。总体而言,该编码方案与MR非常相似,不同之处在于MMR以不同的方式对第一行进行编码,并且与MR不同,它避免了在MH模式下对图像数据的每Kth行(K=2或4)进行编码。

Table 1: Standard Reference table for Vertical coding mode13
表1:垂直编码模式的标准参考表13
3.可视化TIFF压缩数据(Visualizing TIFF Compressed Data)
Data visualization16 is the presentation of data in a pictorial or graphical format, so that the nature of data is easily understood, analyzed and interpreted. In this section, the visualization of TIFF compressed data is demonstrated. A sample binary image pattern consisting of 5 rows and 16 columns is shown in Fig-8. It has five scan lines which are made up of black and white pixels.
数据可视化16是以图形或图形格式呈现数据,以便轻松理解、分析和解释数据的性质。在本节中,演示了TIFF压缩数据的可视化。由5行16列组成的样本二进制图像模式如图8所示。它有五条扫描线,由黑色和白色像素组成。
图6 黑白图像示例
MH编码技术逐行读取图像中存在的黑白像素行,并使用标准霍夫曼表13对其进行编码。图6所示的样本图像逐行生成的MH压缩数据如下所示:
For the sake of illustrating MR/MMR coding technique, all the changing elements with respect to the reference line (first scan line) in Fig-6 are marked and shown in Fig-7. From the figure, it can be observed that every changing element in the coding lines (in scan lines 2, 3 and partially in 4) are underneath or at least very close to that of the reference line. Therefore these positions will be encoded with a Vertical Mode. In Vertical Mode, the relative positions are within the proximity of three pixels with each other, and hence this type of encoding is most commonly occurring in a document image. Specifically, the case where the positions of the changing pixels are identical, known as V(0) is encoded using a single bit (the other options being VL(3) to VR(3) in Table-1). It can be observed that a vertical line of any thickness can be coded very efficiently. This interesting pattern is observed in bar-codes and has been used for automatic detection of bar-codes by 17. On the other hand, when the bottom of the image is detected Pass Mode is encountered. This is because the position of the changing pixel is very different to that on the line above. This implies skipping of two changing pixels, to black and back to white on the line above. These interesting features were explored by Lu and Tan3,8 for word searching and document retrieval purpose, for simulating an OCR by 9 , for skew detection by 18, for document similarity and equivalence by 19,20. The positions of changing pixels which are not in close proximity to those above are encoded in pairs using the Horizontal Mode of Group 3 encoding mechanism. The MR compressed data generated for the sample image in Fig-7 using MR(K=5) code is given below,
为了说明MR/MMR编码技术,图6中相对于参考线(第一扫描线)的所有变化元件都被标记并显示在图7中。从图中可以观察到,编码线(扫描线2、3和部分扫描线4)中的每个变化元件都在参考线的下方或至少非常接近参考线。因此,这些位置将使用垂直模式进行编码。在垂直模式中,相对位置彼此在三个像素的附近,因此这种类型的编码最常见于文档图像中。具体而言,使用单个比特对变化像素的位置相同的情况(称为V(0))进行编码(表1中的其他选项为VL(3)至VR(3))。可以观察到,可以非常有效地对任何厚度的垂直线进行编码。这种有趣的模式在条形码中被观察到,并已被17用于条形码的自动检测。另一方面,当检测到图像的底部时,会遇到通过模式。这是因为变化像素的位置与上面一行上的位置非常不同。这意味着跳过上面一行的两个变化像素,变为黑色,再变回白色。Lu和Tan 3,8对这些有趣的特征进行了探索,用于单词搜索和文档检索,用于模拟9的OCR,用于18的偏斜检测,用于19,20的文档相似性和等价性。使用第3组的水平模式编码机制成对地对与上述像素不太接近的变化像素的位置进行编码。使用MR(K=5)为图7中的样本图像生成的MR压缩数据下面给出代码:

The equivalent Hexadecimal code for the above MR(K=5)/MMR compressed data is given as 87 73 80 87 73 80 87 73 80 72 89 C0 39 8E
上述MR(K=5)/MMR压缩数据的等效十六进制代码为87 73 80 87 73 80 73 80 72 89 C0 39 8E
3.1.二进制查看器(Binary Viewer )
When the sample image shown in Fig-6 is compressed with TIFF format, the compressed data is generated and it can be visualized as shown in Fig-8, using a Binary Viewer Software21. The file header, compressed data and the tags in the compressed format are clearly marked in Fig-8. The compressed data is shown within a Red colored Manhattan layout. The other layouts in Blue, Yellow and Pink colors indicate, the file header, the tags associated with the file and end of file.
当图6所示的样本图像以TIFF格式压缩时,会生成压缩数据,并且可以使用二进制查看器软件21将其可视化,如图8所示。文件头、压缩数据和压缩格式的标签在图8中清晰标记。压缩后的数据显示在红色的曼哈顿布局中。蓝色、黄色和粉色的其他布局表示文件头、与文件相关的标记和文件结尾。

Fig. 8. Binary viewer visualization of TIFF compressed data of a sample image in Fig-8 using MR/MMR codes.
图8 使用MR/MMR代码对图8中样本图像的TIFF压缩数据进行二进制查看器可视化。

图9 (a) 行程长度压缩数据和(b)其等效图像视图。
Fig. 9. (a) Run-length compressed data and (b) its equivalent image view.
3.2.图像视图(Image View)
The other way to visualize the compressed data is to generate line by line the run-length data from the MH/MR/MMR codes and then view it as an image. Both run-length compressed data and its equivalent image view for the sample image in Fig-6 is shown in Fig-9.
可视化压缩数据的另一种方法是从MH/MR/MMR代码逐行生成行程长度数据,然后将其视为图像。图6中样本图像的行程长度压缩数据及其等效图像视图如图9所示。
4.行程长度压缩域(Run-Length Compressed Domain)
In this section, we propose a new compressed domain model called as Run-Length Compressed Domain (RLCD) for processing the compressed document data of all the three related coding modes (MH/MR/MMR) of CCITT compression. The RLCD model is shown in Fig-10, which generates run-length data intelligently with the help of the proposed run-length data extraction algorithm (see Fig-11) and defines compressed domain operations and analytics over the generated data. In the proposed model, reiterating from the work of 1,5,7,11, it can be observed in Fig-10 that carrying out decompression is avoided.
在本节中,我们提出了一种新的压缩域模型,称为游程压缩域(RLCD),用于处理CCITT压缩的所有三种相关编码模式(MH/MR/MMR)的压缩文档数据。RLCD模型如图10所示,它在所提出的游程数据提取算法的帮助下智能地生成游程数据(见图11),并对生成的数据定义压缩域操作和分析。在所提出的模型中,从1,5,7,11的工作中重申,可以在图10中观察到,避免了进行减压。


Fig. 11. Proposed run-length code extraction algorithm.
图11。提出了游程码提取算法。
The run-length data extracted for each row in the sample image in Fig-6 is given in Table-2.
表2给出了图6中样本图像中每行提取的行程长度数据。
| W | B | W | B | W |
| 3 | 4 | 4 | 3 | 2 |
| 3 | 4 | 4 | 3 | 2 |
| 3 | 4 | 4 | 3 | 2 |
| 3-1 (VL))= 2 | 4+1(VL)+1(VR)=6 | 4-1(VR)=3 | 3 | 2 |
| 3+Pass(4+4)-1(VL))=10 | 3+1(VL)+1(VR)=5 | 2-1(VR)=1 |
Table 2: Illustration of run-length data extraction using the proposed algorithm (In the table, W is white run and B is black run)
表2:使用所提出的算法提取行程长度数据的说明(表中,W为白行程,B为黑行程)
5. 实验结果(Experimental Results)
In this section, we conduct experiment on MH, MR and MMR compressed documents to validate the run-length code generated using the proposed run-length code extraction algorithm. The ground truth data for the experiment is generated by directly decompressing the MH/MR/MMR compressed data to pixel data. On the other hand, the test data is generated by decompressing the run-length data extracted the proposed algorithm. The decompressed results from both test and ground truth is measured using correlation measure given below:
在本节中,我们对MH、MR和MMR压缩文档进行了实验,以验证使用所提出的行程码提取算法生成的行程码。通过将MH/MR/MMR压缩数据直接解压缩为像素数据来生成实验的地面实况数据。另一方面,通过对所提出的算法提取的游程长度数据进行解压缩来生成测试数据。使用下面给出的相关测量来测量来自测试和地面实况的解压缩结果:
(1)
where m and n are the corresponding rows and columns in the test image(A)after decompression, and ground truth image(B).
其中m和n是解压缩后的测试图像(A)和地面实况图像(B)中的对应行和列。

表3 各压缩算法实验结果
In the literature, lot of work on direct processing of run-length compressed document data have been proposed. The document image operations and analytics such as feature extraction1,5,11, page segmentation6,12, text segmentation1,7, font size detection11 using Run-Length data have been attempted. Therefore, based on the experimental results in Table-3, all these operations and analytics can be extended to work with the advanced CCITT compression schemes underscored in this research paper. One such extension is illustrated taking the application of text segmentation and subsequently word spotting7 . The experimental results of text segmentation and word spotting for a subset of documents from the dataset of 7 is tabulated in Table-4 and Table-5.
在文献中,已经提出了许多关于直接处理行程长度压缩文档数据的工作。已经尝试使用Run Length数据进行文档图像操作和分析,如特征提取1、5、11、页面分割6、12、文本分割1、7、字体大小检测11。因此,根据表3中的实验结果,所有这些操作和分析都可以扩展到使用本文强调的高级CCITT压缩方案。一个这样的扩展以文本分割和随后的分词的应用为例进行了说明7。表4和表5列出了来自数据集7的文档子集的文本分割和分词的实验结果。

Table 4. The Accuracy of Word and Character segmentation for 60 (20 documents each using MH, MR and MMR compression) compressed text document
表4 60个压缩文本文档(每个文档使用MH、MR和MMR压缩)的单词和字符分割的准确性

Table 5. The Accuracy of Word and Character segmentation for 60 (20 documents each using MH, MR and MMR compression) compressed text documents
表5 60个压缩文本文档(每个文档使用MH、MR和MMR压缩)的单词和字符分割的准确性
6. 结论(Conclusion)
This research paper a novel idea of carrying out document image analysis directly in Run-Length Compressed Domain (RLCD) which is capable of handling the compressed data of CCITT Group 3 1-D and 2-D, and CCITT Group 4 2-D is proposed. This is accomplished by an algorithm that intelligently extracts the run-length data from T.4 and T.6 compressed TIFF documents and subsequently extends the proposed model to advanced compression schemes of CCITT. The compressed data generated from the proposed algorithm were experimentally validated using correlation measure.
本文提出了一种直接在游程压缩域(RLCD)中进行文档图像分析的新思路,该思路能够处理CCITT Group 3 1-D和2-D以及CCITT Group 4 2-D的压缩数据。这是通过一种算法实现的,该算法从T.4和T.6压缩的TIFF文档中智能地提取行程长度数据,并随后将所提出的模型扩展到CCITT的高级压缩方案。使用相关测度对所提出的算法生成的压缩数据进行了实验验证。
引用(References)
1. M. Javed, P. Nagabhushan, and B. B. Chaudhuri, ³Extraction of projection profile, run-histogram and entropy features straight from run-length compressed documents,´ ACPR, pp. 813 - 817, November 2013.
2. J. Lu and D. Jiang, ³Survey on the technology of image processing based on dct compressed domain,´ ICMT, pp. 786 - 789, 2011.
3. Y. Lu and C. L. Tan, ³Word searching in ccitt group 4 compressed document images,´ ICDAR, pp. 467 - 471, 2003.
4. J. Mukhopadhyay, Image and Video Processing in Compressed Domain. Chapman and Hall/CRC, 2011.
5. M. Javed, P. Nagabhushan, and B. B. Chaudhuri, ³Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain´, In the IEEE Proceedings of ICDAR 2015.
6. M. Javed, P. Nagabhushan, and B. B. Chaudhuri, ³Automatic page segmentation without decompressing the run-length compressed printed text documents,´ International Journal of Information Processing Systems (JIPS) (Accepted for Publication), 2015.
7. M. Javed, P. Nagabhushan, and B. B. Chaudhuri, ³A direct approach for word and character segmentation in run-length compressed documents and its application to word spotting,´ , In the IEEE Proceedings of ICDAR 2015.
8. Y. Lu and C. L. Tan, ³Document retrieval from compressed images,´ Pattern Recognition, vol. 36, pp. 987 - 996, 2003.
9. U. V. Marti, D. Wymann, and H. Bunke, ³Ocr on compressed images using pass modes and hidden markov models,´Proceedings of IAPR Workshop on Document Analysis Systems, pp. 77 - 86, 2000.
10. E. Regentova, S. Latifi, D. Chen, K. Taghva, and D. Yao, ³Document analysis by processing jbig-encoded images,´ IJDAR, vol. 7, pp. 260 - 272, 2005.
11. M. Javed, P. Nagabhushan, and B. B. Chaudhuri, ³Automatic detection of font size straight from run length compressed text documents,´ IJCSIT, vol. 5, pp. 818-825, February 2014.
12. M. Javed, P. Nagabhushan, and B. B. Chaudhuri, ³Direct processing of run-length compressed document image for segmentation and characterization of a specified block,´ IJCA, vol. 83(15), pp. 1-6, December 2013.
13. CCITT-Recommedation(T.4), ³Standardization of group 3 facsimile apparatus for document transmission, terminal equipments and protocols for telematic services, vol. vii, fascicle, vii.3, geneva,´ tech. rep., 1985.
14. CCITT-Recommedation(T.6), ³Standardization of group 4 facsimile apparatus for document transmission, terminal equipments and protocols for telematic services,vol. vii, fascicle, vii.3, geneva,´ tech. rep., 1985.
15. TIFF, ³(tagged image file format) revision 6.0 specification,´ tech. rep., 1992.
16. ³Data visualization (www.sas.com/en-us/insights/big-data/data-visualization.html).´
17. C. Maa, ³Identifying the existence of bar codes in compressed images,´ CVGIP: Graphical Models and Image Processing, vol. 56, pp. 352 - 356, July 1994.
18. A. L. Spitz, ³Analysis of compressed document images for dominant skew, multiple skew, and logotype detection,´ Computer vision and Image Understanding, vol. 70, pp. 321-334, June 1998.
19. J. J. Hull and J. Cullen, ³Document image similarity and equivalence detection,´ Proceedings of the Fourth International Conference on Document Analysis and Recongnition(ICDAR'97), vol. 1, pp. 308 - 312, 1997.
20. J. J. Hull, ³Document image similarity and equivalence detection,´ International Journal on Document Analysis and Recognition ,(IJDAR'98) vol. 1, pp. 37-42, 1998.
21. ³Binary viewer software (http://www.proxoft.com/binaryviewer.aspx)."
浙公网安备 33010602011771号