



Extensive investigations are described in this paper, in order to determine which is the most appropriate type of PPM scheme that can be applied to the problem of automatically breaking substitution ciphers. Experimental results showed that approximately 92% of the cryptograms were decrypted correctly without any errors and 100% with just three errors or less. In particular, this paper presents how a Prediction by Partial Matching (PPM) text compression scheme, a method that shows a high level of performance when applied to different natural language processing tasks, can also be used for the automatic decryption of simple substitution ciphers. A new compression based method for the automatic cryptanalysis of simple substitution ciphers is introduced in this paper. We demonstrate local entropy using a substitution cipher along with the results for an algorithm based on the principle, and show that Shannon's unicity is an average measure rather than a lower bound this motivates us to present a discussion on the implications of local entropy and unicity distance.Īutomatic recognition of correct solutions as a result of a ciphertext only attack of simple ciphers is not a trivial issue and still remains a taxing problem. Our local entropy measure explains why some texts are susceptible to decryption using fewer symbols than predicted by Shannon's unicity while other texts require more. Since the unicity distance is dependent on the entropy (because entropy is the basis of calculations for the unicity distance), local entropy leads to a local unicity distance for a string. HL(s) includes a priori information about the language and text at the time of application. Thus, we introduce a new measure, HL(s), called the "local entropy" of a string s.

However, while applying entropy and unicity to language(s), e.g., encryption and decryption, the symbols (letters) of a language are not independent. As introduced by Shannon in "Communication Theory of Secrecy Systems", entropy and unicity distance are defined at a global level, under the assumption that the properties of symbols resemble that of independent random variables.
