Journal of Robotics, Networking and Artificial Life

Volume 5, Issue 2, September 2018, Pages 93 - 96

A Method for Secure Communication Using a Discrete Wavelet Transform for Audio Data and Improvement of Speaker Authentication

Authors
Kouhei Nishimura1, Yasunari Yoshitomi2, yoshitomi@kpu.ac.jp, Taro Asada2, t_asada@mei.kpu.ac.jp, Masayoshi Tabuse2, tabuse@kpu.ac.jp
1Nippon Telegraph and Telephone West Corp., 3-15 Bamba-cho, Chuo-ku, Osaka 540-8511, Japan
2Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Nakaragi-cho, Shimogamo, Sakyo-ku, Kyoto 606-8522, Japan
Available Online 30 September 2018.
DOI
10.2991/jrnal.2018.5.2.4How to use a DOI?
Keywords
Secure communication; Audio data processing; Wavelet transform; Encoding
Abstract

We developed a secure communication method using a discrete wavelet transform. Two users must each have a copy of the same piece of music to be able to communicate with each other. The message receiver can produce audio data similar to the sending user's speech by using our previously proposed method and the given recording of music. To improve the accuracy of speaker authentication, the quantization level for the scaling coefficients is increased. Furthermore, the amount of data sent to the message receiver can be remarkably reduced by exploiting the characteristics of this data.

Copyright
Copyright © 2018, the Authors. Published by Atlantis Press.
Open Access
This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

1. Introduction

The elderly are often targets of telephone fraud. The fraudster pretends to be a grandchild of the elderly person while talking on the phone, and appeals to the elderly person to send money, for example, through a bank transfer. In the present study, we propose a method for secure communication using a discrete wavelet transform (DWT) and thus improve speaker authentication; this is an enhancement of our previously proposed method.1 It can be used with Internet protocol (IP) telephones, and it has the potential to help prevent telephone fraud.

2. Proposed Method

2.1. Encoding

2.1.1. Phenomenon exploited for the coding algorithm for audio data

In the course of our research,1 we found that the histogram of the scaling coefficients for each domain of a multiresolution analysis (MRA) sequence is centered at approximately zero when a DWT is performed on audio data. Exploiting this phenomenon, we have developed a secure communication method using audio data.1

2.1.2. Use of five quantization levels for scaling coefficients

(1) Parameter setting

In our reported study,1 we set the following coding parameters.

The values of Th(minus) and Th(plus) in Fig. 1 are chosen such that the nonpositive scaling coefficients (Sm in total frequency) are equally divided into two groups by Th(minus), and the positive scaling coefficients (Sp in total frequency) are equally divided into two groups by Th(plus). Next, the values of T1, T2, T3, and T4, which are the parameters for controlling the authentication precision, are chosen to satisfy the following conditions:

  1. 1)

    T1 < Th(minus) < T2 < 0 < T3 < Th(plus) < T4

  2. 2)

    The value of ST1, which is the number of scaling coefficients in (T1, Th(minus)), is equal to ST2, which is the number of scaling coefficients in [Th(minus),T2), i.e., ST1 = ST2.

  3. 3)

    The value of ST3, the number of scaling coefficients in (T3, Th(plus)], is equal to ST4, the number of scaling coefficients in (Th(plus),T4), i.e., ST3 = ST4.

  4. 4)

    ST1/Sm = ST3/Sp.

Fig. 1.

Schematic diagram for demonstrating the selection of the scaling coefficients for encoding the audio data.1

In the present study, the values of both ST1/Sm and ST3/Sp are set to 0.3, which was determined experimentally.

(2) Encoding

In the preprocessing of the audio data prior to encoding, the scaling coefficients V of the MRA sequence are separated into five sets (G0 to G4), as shown in Fig. 1, under the following criteria:

  • G0 = {V | VVSC, VT1},

  • G1 = {V | VVSC, T1 < V < T2},

  • G2 = {V | VVSC, T2 ≤ V ≤ T3},

  • G3 = {V | VVSC, T3 < V < T4},

  • G4 = {V | VVSC, T4 ≤ V},

where VSC is the set of scaling coefficients in the audio data file.

The scaling coefficients for the MRA sequence are encoded according to the following rules, where Vi denotes scaling coefficient i: ViG0, ci = 0; when ViG1, ci = 1; when ViG2, ci = 2; when ViG3, ci = 3; and when ViG4, ci = 4. We represent the scaling coefficient for each set, Gj, by its average value, mj. For the formation of audio data, we use a code C, which is the sequence of ci and mj defined above.

2.1.3. Use of eight quantization levels for scaling coefficients

Here, we define eight sets of G8,0, to G8,7, as follows:

  • G8,0 = {V | VVSC, VT1},

  • G8,1 = {V | VVSC, T1 < V < Th(minus)},

  • G8,2 = {V | VVSC, Th(minus) ≤ V ≤ T2},

  • G8,3 = {V | VVSC, T2 < V < 0},

  • G8,4 = {V | VVSC, 0 ≤ VT3},

  • G8,5 = {V | VVSC, T3 < V < Th(plus)},

  • G8,6 = {V | VVSC, Th(plus) ≤ VT4},

  • G8,7 = {V | VVSC, T4 < V}.

Again, we let the representative value for each set, G8,i, be its average, m8,i. For the formation of audio data, we use the code C8, which is the sequence of c8,i defined for eight quantization levels for scaling coefficients in the similar manner as ci described in Section 2.1.2, and m8,j as defined above.

2.1.4. Use of 16 quantization levels for scaling coefficients

(1) Parameter setting

The values of T1m, T1p, T2m, T2p, T3m, T3p, T4m, and T4p, which are the parameters for controlling the authentication precision, are chosen to satisfy the following conditions:

  1. 1)

    T1m < T1 < T1p < Th(minus) < T2m < T2 < T2p < 0 < T3m < T3 < T3p < Th(plus) < T4m < T4 < T4p

  2. 2)

    The value of T1m is defined so that it equally divides the number of scaling coefficients in [V min, T1]. T1p, T2m,..., T4p are defined similarly to T1m.

(2) Encoding

Sixteen sets of G16,0 to G16,15 are defined as follows:

  • G16,0 = {V | VVSC, VT1m},

  • G16,1 = {V | VVSC, T1m < V < T1},

  • G16,2 = {V | VVSC, T1 ≤ V ≤ T1p},

  • G16,3 = {V | VVSC, T1p < V < Th(minus)},

  • G16,4 = {V | VVSC, Th(minus) ≤ VT2m},

  • G16,5 = {V | VVSC, T2m < V < T2},

  • G16,6 = {V | VVSC, T2 ≤ VT2p},

  • G16,7 = {V | VVSC, T2p < V < 0},

  • G16,8 = {V | VVSC, 0 ≤ VT3m},

  • G16,9 = {V | VVSC, T3m < V < T3},

  • G16,10 = {V | VVSC, T3 ≤ V ≤ T3p},

  • G16,11 = {V | VVSC, T3p < V < Th(plus)},

  • G16,12 = {V | VVSC, Th(plus) ≤ VT4m},

  • G16,13 = {V | VVSC, T4m < V < T4},

  • G16,14 = {V | VVSC, T4 ≤ V ≤ T4p},

  • G16,15 = {V | VVSC, T4p < V}.

As before, the value for each set, G16,i, is represented by its average value, m16,i. For the formation of audio data, we use the code C16, which is the sequence of c16, i defined for 16 quantization levels for scaling coefficients in the similar manner as ci described in Section 2.1.2, and m16,j defined above.

2.2. Audio data formation using code replacement

In this subsection, the formation of sound data is explained; for this example, we use five quantization levels for the scaling coefficient.1 The scaling coefficient sequence for audio data A is expressed as S(A)k = {x1, x2, x3,...,xk}, where k is the total number of scaling coefficients of A at this level. Then, the sequence C(A)k = {X1, X2, X3,..., Xk} is determined, where Xi ∈ {0,1,2,3,4} is the element index, which indicates to which of the five sets of scaling coefficients xi of A belongs. Next, the audio data A′ is defined as having the scaling coefficient sequence S(A′)k and a value of zero for all wavelet coefficient values at every level. S(A′)k is defined as S(A′)k = {a1, a2, a3,..., ak}, where ai{m0A,m1A,m2A,m3A,m4A} is the average of the scaling coefficients of A in the range denoted by Xi ∈ {0,1,2,3,4} and is obtained from A. Then, the audio data BA is defined as having the scaling coefficient sequence S(B′A)k and a value of zero for all wavelet coefficient values at every level. S(B′A)k is defined as S(B′A)k = {bA,1,bA,2,bA,3,...,bA, k}, where bA,i{m0B,m1B,m2B,m3B,m4B} is the average of the scaling coefficients of B in the range denoted by Xi ∈ {0,1,2,3,4} obtained from A. S(B′A)k is obtained by replacing Yi with Xi when YiXi, and then replacing bi with bA,i, where bi is the average of the scaling coefficients of B in the range denoted by Yi. Therefore, C(B′A)k = C(A)k. As a result, B′A is expected to be similar to A.

2.3 Data for communication

A sequence D1(BA)n is defined as D1(B′A)n = {z1, z2,..., zn}, where n is the total number of cases where YiXi, zp = [|yi|]mod 256, and the integer p is increased from 1 to n, in steps of size 1, when YiXi.1 Here, [x] signifies the maximum integer that is not greater than x. Then, a sequence D2(B′A)n is defined as D2(B′A)n = {Z1, Z2,..., Zn}, where n is the total number of cases for which YiXi and Zp = Xi.1

In communications between two users, the message sender and the receiver each have the secret key B, and the sender sends D1(B′A)n and D2(B′A)n to the receiver.1 Then, the receiver composes B″A, which is defined in Section 2.4 and is expected to be similar to A.

2.4. Audio data composition

In this subsection, the processing of sound data formation is also explained using the case of five quantization levels, as an example, for the scaling coefficient.1 The scaling coefficient sequence for audio data B is expressed as S(B)k = {y1, y2, y3,..., yk}, where k is the total number of scaling coefficients of B at this level. Then, a sequence C(B)k = {Y1,Y2,Y3,...,Yk} is determined, where Yi ∈ {0,1,2,3,4} is the element index, which indicates to which of the five sets of scaling coefficients yi of B belongs. S(B′)k is defined as S(B′)k = {b1,b2,b3,...,bk}, where bi{m0B,m1B,m2B,m3B,m4B} is the average of the scaling coefficients of B at the range denoted by Yi ∈ {0,1,2,3,4} and is obtained from B.

A sequence D3(B)k is defined as D3(B)k = {zB,1,zB,2,...,zB,k}, where k is the total number of scaling coefficients of B at this level, and zB,q = [|yq|]mod 256. B″A is determined as follows: S(B″A)k is calculated from S(B′)k by replacing bq with mZpB when zB,q = zp, for p = 1,…, n, then the audio data B″A is composed using the inverse DWT (IDWT) of the scaling coefficient sequence S(B″A)k and the value of zero for all wavelet coefficients at every level. The receiver composes B″A from D1(B′A)n and D2(B′A)n, which are determined by both A and B and are sent by the sender, and B, which the receiver has obtained prior to the conversation. B″A is expected to be similar to A.

2.5. Data reduction

2.5.1. Processing for D1

Because zp = [|yi|]mod 256, zp is in the range from 0 to 255, and thus it can be expressed using 8 bits. In our computer, an integer is represented by 32 bits. Therefore, four values for zp, each expressed using 8 bits, can be integrated into a single value expressed by 32 bits. For D1(B′A)n = {z1, z2,..., zn}, zj is defined as

zj=z4i3+z4i2×256+z4i1×2562+z4i×2563,
where i, j are natural numbers. As a result, we obtain a sequence for D1′(BA)m = {z1, z2,..., zm}, where
m={[n/4](nmod4=0)[n/4]+1(nmod40).
When n mod 4 ≠ 0, z4m+k–2 = 0 (k = 0,...,|n mod4–3|). Here, [x] is defined as in Section 2.3. In the first case of the above formula on m, the total amount of data, D1′, stored in a computer is thus one quarter of that stored for D1. However, the total amount of data sent to a receiver depends on the way in which the data are expressed.

2.5.2. Processing for D2

(1) Case of five quantization levels

D2′(BA)n = {Z1,Z2,...,Zn} and D2′(BA)l = {Z1,Z2,...,Zl}, where Zj = Z13i–12 + Z13i–11 × 5 + Z13i–10 × 52 + ∙ ∙ ∙ + Z13i×512, are defined as described in Section 2.5.1.

(2) Case of eight quantization levels

D2′(BA)n = {Z1,Z2,...,Zn} and D2″(BA)r = {Z1,Z2,...,Zr}, where Z″j = Z10i–9 + Z10i–8 × 8 + Z10i–7 × 82 + ∙ ∙ ∙ + Z10i × 89, are defined as described in Section 2.5.1.

(3) Case of 16 quantization levels

D2′(BA)n = {Z1,Z2,...,Zn} and D2″(BA)s = {Z′1,Z′2,...,Z′s}, where Z′″j = Z8i–7 + Z8i–6 × 16 + Z8i–5 × 162 + ∙ ∙ ∙ + Z8i × 167, are defined as described in Section 2.5.1.

3. Numerical Experiment

We applied the proposed method, using several voice recordings for A, and for B, we used two recordings of music, one classical and the other hip-hop. The music was taken from a copyright-free database.2 In all cases, all of the produced BA were audible and sounded similar to A; each BA was made with five, eight, or 16 quantization levels. An increase in the quantization level improved the sound quality because a waveform made from BA with a higher quantization level was more similar to the original waveform than was one made with a lower quantization level, as shown in Fig. 2. For (1), (2), and (3) in Section 2.5.2, the data reduction for one minute of audio data at 44.1 kHz, 16 bits, a single channel, and volume of 87 KB was as follows:

  1. (1)

    D1(75 KB) → D1′(48 KB), D2(49 KB) → D2′(9 KB)

  2. (2)

    D1(86 KB) → D1′(55 KB), D2(57 KB) → D2″(21 KB)

  3. (3)

    D1(92 KB) → D1′(59 KB), D2(65 KB) → D2″′(29 KB)

Fig. 2.

Examples of waveform

4. Conclusion

We developed a secure communication method using a discrete wavelet transform for audio data; we used an increased number of quantization levels for the scaling coefficients along with a data reduction technique. The waveform produced by the proposed method was more similar to the original one than that produced by our previously proposed method.1

References

1.Y Tsuda, K Nishimura, H Oyaizu, Y Yoshitomi, T Asada, and M Tabuse, A method for secure communication using a discrete wavelet transform for audio data, J. Robotics, Networking and Artif. Life, Vol. 3, No. 3, 2016, pp. 193-196.
2.M Goto, H Hashiguchi, T Nishimura, and R Oka, RWC music database: database of copyright-cleared musical pieces and instrument sounds for research purposes, Trans. IPSJ, Vol. 45, No. 3, 2004, pp. 728-738.
Journal
Journal of Robotics, Networking and Artificial Life
Volume-Issue
5 - 2
Pages
93 - 96
Publication Date
2018/09/30
ISSN (Online)
2352-6386
ISSN (Print)
2405-9021
DOI
10.2991/jrnal.2018.5.2.4How to use a DOI?
Copyright
Copyright © 2018, the Authors. Published by Atlantis Press.
Open Access
This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

Cite this article

TY  - JOUR
AU  - Kouhei Nishimura
AU  - Yasunari Yoshitomi
AU  - Taro Asada
AU  - Masayoshi Tabuse
PY  - 2018
DA  - 2018/09/30
TI  - A Method for Secure Communication Using a Discrete Wavelet Transform for Audio Data and Improvement of Speaker Authentication
JO  - Journal of Robotics, Networking and Artificial Life
SP  - 93
EP  - 96
VL  - 5
IS  - 2
SN  - 2352-6386
UR  - https://doi.org/10.2991/jrnal.2018.5.2.4
DO  - 10.2991/jrnal.2018.5.2.4
ID  - Nishimura2018
ER  -