A Method for Secure Communication Using a Discrete Wavelet Transform for Audio Data and Improvement of Speaker Authentication

Kouhei Nishimura; Yasunari Yoshitomi; Taro Asada; Masayoshi Tabuse

doi:10.2991/jrnal.2018.5.2.4

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Volume 5, Issue 2, September 2018, Pages 93 - 96

A Method for Secure Communication Using a Discrete Wavelet Transform for Audio Data and Improvement of Speaker Authentication

Authors

Kouhei Nishimura¹, Yasunari Yoshitomi²^,yoshitomi@kpu.ac.jp, Taro Asada²^,t_asada@mei.kpu.ac.jp, Masayoshi Tabuse²^,tabuse@kpu.ac.jp

¹Nippon Telegraph and Telephone West Corp., 3-15 Bamba-cho, Chuo-ku, Osaka 540-8511, Japan

²Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Nakaragi-cho, Shimogamo, Sakyo-ku, Kyoto 606-8522, Japan

http://www2.kpu.ac.jp/ningen/infsys/English_index.html

Available Online 30 September 2018.

DOI: 10.2991/jrnal.2018.5.2.4 How to use a DOI?
Keywords: Secure communication; Audio data processing; Wavelet transform; Encoding
Abstract: We developed a secure communication method using a discrete wavelet transform. Two users must each have a copy of the same piece of music to be able to communicate with each other. The message receiver can produce audio data similar to the sending user's speech by using our previously proposed method and the given recording of music. To improve the accuracy of speaker authentication, the quantization level for the scaling coefficients is increased. Furthermore, the amount of data sent to the message receiver can be remarkably reduced by exploiting the characteristics of this data.
Copyright: Copyright © 2018, the Authors. Published by Atlantis Press.
Open Access: This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

1. Introduction

The elderly are often targets of telephone fraud. The fraudster pretends to be a grandchild of the elderly person while talking on the phone, and appeals to the elderly person to send money, for example, through a bank transfer. In the present study, we propose a method for secure communication using a discrete wavelet transform (DWT) and thus improve speaker authentication; this is an enhancement of our previously proposed method.¹ It can be used with Internet protocol (IP) telephones, and it has the potential to help prevent telephone fraud.

2. Proposed Method

2.1. Encoding

2.1.1. Phenomenon exploited for the coding algorithm for audio data

In the course of our research,¹ we found that the histogram of the scaling coefficients for each domain of a multiresolution analysis (MRA) sequence is centered at approximately zero when a DWT is performed on audio data. Exploiting this phenomenon, we have developed a secure communication method using audio data.¹

2.1.2. Use of five quantization levels for scaling coefficients

(1) Parameter setting

In our reported study,¹ we set the following coding parameters.

The values of Th(minus) and Th(plus) in Fig. 1 are chosen such that the nonpositive scaling coefficients (S_m in total frequency) are equally divided into two groups by Th(minus), and the positive scaling coefficients (S_p in total frequency) are equally divided into two groups by Th(plus). Next, the values of T1, T2, T3, and T4, which are the parameters for controlling the authentication precision, are chosen to satisfy the following conditions:

1)
T1 < Th(minus) < T2 < 0 < T3 < Th(plus) < T4
2)
The value of S_T₁, which is the number of scaling coefficients in (T1, Th(minus)), is equal to S_T₂, which is the number of scaling coefficients in [Th(minus),T2), i.e., S_T₁ = S_T₂.
3)
The value of S_T₃, the number of scaling coefficients in (T3, Th(plus)], is equal to S_T₄, the number of scaling coefficients in (Th(plus),T4), i.e., S_T₃ = S_T₄.
4)
S_T₁/S_m = S_T₃/S_p.

In the present study, the values of both S_T₁/S_m and S_T₃/S_p are set to 0.3, which was determined experimentally.

(2) Encoding

In the preprocessing of the audio data prior to encoding, the scaling coefficients V of the MRA sequence are separated into five sets (G₀ to G₄), as shown in Fig. 1, under the following criteria:

•
G₀ = {V | V ∈ V^SC, V ≤ T1},
•
G₁ = {V | V ∈ V^SC, T1 < V < T2},
•
G₂ = {V | V ∈ V^SC, T2 ≤ V ≤ T3},
•
G₃ = {V | V ∈ V^SC, T3 < V < T4},
•
G₄ = {V | V ∈ V^SC, T4 ≤ V},

where V^SC is the set of scaling coefficients in the audio data file.

The scaling coefficients for the MRA sequence are encoded according to the following rules, where V_i denotes scaling coefficient i: V_i ∈ G₀, c_i = 0; when V_i ∈ G₁, c_i = 1; when V_i ∈ G₂, c_i = 2; when V_i ∈ G₃, c_i = 3; and when V_i ∈ G₄, c_i = 4. We represent the scaling coefficient for each set, G_j, by its average value, m_j. For the formation of audio data, we use a code C, which is the sequence of c_i and m_j defined above.

2.1.3. Use of eight quantization levels for scaling coefficients

Here, we define eight sets of G₈,₀, to G_8,7, as follows:

•
G_8,0 = {V | V ∈ V^SC, V ≤ T1},
•
G_8,1 = {V | V ∈ V^SC, T1 < V < Th(minus)},
•
G_8,2 = {V | V ∈ V^SC, Th(minus) ≤ V ≤ T2},
•
G_8,3 = {V | V ∈ V^SC, T2 < V < 0},
•
G_8,4 = {V | V ∈ V^SC, 0 ≤ V ≤ T3},
•
G_8,5 = {V | V ∈ V^SC, T3 < V < Th(plus)},
•
G_8,6 = {V | V ∈ V^SC, Th(plus) ≤ V ≤ T4},
•
G_8,7 = {V | V ∈ V^SC, T4 < V}.

Again, we let the representative value for each set, G₈,_i, be its average, m_8,i. For the formation of audio data, we use the code C₈, which is the sequence of c_8,_i defined for eight quantization levels for scaling coefficients in the similar manner as c_i described in Section 2.1.2, and m_8,_j as defined above.

2.1.4. Use of 16 quantization levels for scaling coefficients

(1) Parameter setting

The values of T1m, T1p, T2m, T2p, T3m, T3p, T4m, and T4p, which are the parameters for controlling the authentication precision, are chosen to satisfy the following conditions:

1)
T1m < T1 < T1p < Th(minus) < T2m < T2 < T2p < 0 < T3m < T3 < T3p < Th(plus) < T4m < T4 < T4p
2)
The value of T1m is defined so that it equally divides the number of scaling coefficients in [V min, T1]. T1p, T2m,..., T4p are defined similarly to T1m.

(2) Encoding

Sixteen sets of G₁₆,₀ to G_16,15 are defined as follows:

•
G_16,0 = {V | V ∈ V^SC, V ≤ T1m},
•
G_16,1 = {V | V ∈ V^SC, T1m < V < T1},
•
G_16,2 = {V | V ∈ V^SC, T1 ≤ V ≤ T1p},
•
G_16,3 = {V | V ∈ V^SC, T1p < V < Th(minus)},
•
G_16,4 = {V | V ∈ V^SC, Th(minus) ≤ V ≤ T2m},
•
G_16,5 = {V | V ∈ V^SC, T2m < V < T2},
•
G_16,6 = {V | V ∈ V^SC, T2 ≤ V ≤ T2p},
•
G_16,7 = {V | V ∈ V^SC, T2p < V < 0},
•
G_16,8 = {V | V ∈ V^SC, 0 ≤ V ≤ T3m},
•
G_16,9 = {V | V ∈ V^SC, T3m < V < T3},
•
G_16,10 = {V | V ∈ V^SC, T3 ≤ V ≤ T3p},
•
G_16,11 = {V | V ∈ V^SC, T3p < V < Th(plus)},
•
G_16,12 = {V | V ∈ V^SC, Th(plus) ≤ V ≤ T4m},
•
G_16,13 = {V | V ∈ V^SC, T4m < V < T4},
•
G_16,14 = {V | V ∈ V^SC, T4 ≤ V ≤ T4p},
•
G_16,15 = {V | V ∈ V^SC, T4p < V}.

As before, the value for each set, G_16,i, is represented by its average value, m_16,i. For the formation of audio data, we use the code C₁₆, which is the sequence of c_16,_i defined for 16 quantization levels for scaling coefficients in the similar manner as c_i described in Section 2.1.2, and m_16,_j defined above.

2.2. Audio data formation using code replacement

In this subsection, the formation of sound data is explained; for this example, we use five quantization levels for the scaling coefficient.¹ The scaling coefficient sequence for audio data A is expressed as S(A)_k = {x₁, x₂, x₃,...,x_k}, where k is the total number of scaling coefficients of A at this level. Then, the sequence C(A)_k = {X₁, X₂, X₃,..., X_k} is determined, where X_i ∈ {0,1,2,3,4} is the element index, which indicates to which of the five sets of scaling coefficients x_i of A belongs. Next, the audio data A′ is defined as having the scaling coefficient sequence S(A′)_k and a value of zero for all wavelet coefficient values at every level. S(A′)_k is defined as S(A′)_k = {a₁, a₂, a₃,..., a_k}, where $ai∈{m0A,m1A,m2A,m3A,m4A}$ is the average of the scaling coefficients of A in the range denoted by X_i ∈ {0,1,2,3,4} and is obtained from A. Then, the audio data B′_A is defined as having the scaling coefficient sequence S(B′_A)_k and a value of zero for all wavelet coefficient values at every level. S(B′_A)_k is defined as S(B′_A)_k = {b_A_,1,b_A_,2,b_A_,3,...,b_A_,_k}, where $bA,i∈{m0B,m1B,m2B,m3B,m4B}$ is the average of the scaling coefficients of B in the range denoted by X_i ∈ {0,1,2,3,4} obtained from A. S(B′_A)_k is obtained by replacing Y_i with X_i when Y_i ≠ X_i, and then replacing b_i with b_A,i, where b_i is the average of the scaling coefficients of B in the range denoted by Y_i. Therefore, C(B′_A)_k = C(A)_k. As a result, B′_A is expected to be similar to A.

2.3 Data for communication

A sequence D1(B′_A)_n is defined as D1(B′_A)_n = {z₁, z₂,..., z_n}, where n is the total number of cases where Y_i ≠ X_i, z_p = [|y_i|]mod 256, and the integer p is increased from 1 to n, in steps of size 1, when Y_i ≠ X_i.¹ Here, [x] signifies the maximum integer that is not greater than x. Then, a sequence D2(B′_A)n is defined as D2(B′_A)_n = {Z₁, Z₂,..., Z_n}, where n is the total number of cases for which Y_i ≠ X_i and Z_p = X_i.¹

In communications between two users, the message sender and the receiver each have the secret key B, and the sender sends D1(B′_A)_n and D2(B′_A)_n to the receiver.¹ Then, the receiver composes B″_A, which is defined in Section 2.4 and is expected to be similar to A.

2.4. Audio data composition

In this subsection, the processing of sound data formation is also explained using the case of five quantization levels, as an example, for the scaling coefficient.¹ The scaling coefficient sequence for audio data B is expressed as S(B)_k = {y₁, y₂, y₃,..., y_k}, where k is the total number of scaling coefficients of B at this level. Then, a sequence C(B)_k = {Y₁,Y₂,Y₃,...,Y_k} is determined, where Y_i ∈ {0,1,2,3,4} is the element index, which indicates to which of the five sets of scaling coefficients y_i of B belongs. S(B′)_k is defined as S(B′)_k = {b₁,b₂,b₃,...,bk}, where $bi∈{m0B,m1B,m2B,m3B,m4B}$ is the average of the scaling coefficients of B at the range denoted by Y_i ∈ {0,1,2,3,4} and is obtained from B.

A sequence D3(B)_k is defined as D3(B)_k = {z_B_,1,z_B_,2,...,z_B,k}, where k is the total number of scaling coefficients of B at this level, and z_B,q = [|y_q|]mod 256. B″_A is determined as follows: S(B″_A)_k is calculated from S(B′)_k by replacing b_q with $mZpB$ when z_B,q = z_p, for p = 1,…, n, then the audio data B″_A is composed using the inverse DWT (IDWT) of the scaling coefficient sequence S(B″_A)_k and the value of zero for all wavelet coefficients at every level. The receiver composes B″_A from D1(B′_A)_n and D2(B′_A)_n, which are determined by both A and B and are sent by the sender, and B, which the receiver has obtained prior to the conversation. B″_A is expected to be similar to A.

2.5. Data reduction

2.5.1. Processing for D1

Because z_p = [|y_i|]mod 256, z_p is in the range from 0 to 255, and thus it can be expressed using 8 bits. In our computer, an integer is represented by 32 bits. Therefore, four values for z_p, each expressed using 8 bits, can be integrated into a single value expressed by 32 bits. For D1(B′_A)_n = {z₁, z₂,..., z_n}, z′_j is defined as

$z′j=z4i−3+z4i−2×256+z4i−1×2562+z4i×2563,$

where i, j are natural numbers. As a result, we obtain a sequence for D1′(B′_A)_m = {z′₁, z′₂,..., z′_m}, where

$m={[n/4](n mod 4 =0)[n/4]+1(n mod 4 ≠0).$

When n mod 4 ≠ 0, z₄_m₊_k_–2 = 0 (k = 0,...,|n mod4–3|). Here, [x] is defined as in Section 2.3. In the first case of the above formula on m, the total amount of data, D1′, stored in a computer is thus one quarter of that stored for D1. However, the total amount of data sent to a receiver depends on the way in which the data are expressed.

2.5.2. Processing for D2

(1) Case of five quantization levels

D2′(B′_A)_n = {Z₁,Z₂,...,Z_n} and D2′(B′_A)_l = {Z′₁,Z′₂,...,Z′_l}, where Z′_j = Z₁₃_i_–12 + Z₁₃_i_–11 × 5 + Z₁₃_i_–10 × 5² + ∙ ∙ ∙ + Z₁₃_i×5¹², are defined as described in Section 2.5.1.

(2) Case of eight quantization levels

D2′(B′_A)_n = {Z₁,Z₂,...,Z_n} and D2″(B′_A)_r = {Z″₁,Z″₂,...,Z″_r}, where Z″_j = Z₁₀_i–₉ + Z₁₀_i–₈ × 8 + Z₁₀_i–₇ × 8² + ∙ ∙ ∙ + Z₁₀_i × 8⁹, are defined as described in Section 2.5.1.

(3) Case of 16 quantization levels

D2′(B′_A)_n = {Z₁,Z₂,...,Z_n} and D2″(B′_A)_s = {Z′″₁,Z′″₂,...,Z′″_s}, where Z′″_j = Z₈_i–₇ + Z₈_i–₆ × 16 + Z₈_i–₅ × 16² + ∙ ∙ ∙ + Z₈_i × 16⁷, are defined as described in Section 2.5.1.

3. Numerical Experiment

We applied the proposed method, using several voice recordings for A, and for B, we used two recordings of music, one classical and the other hip-hop. The music was taken from a copyright-free database.² In all cases, all of the produced B″_A were audible and sounded similar to A; each B″_A was made with five, eight, or 16 quantization levels. An increase in the quantization level improved the sound quality because a waveform made from B″_A with a higher quantization level was more similar to the original waveform than was one made with a lower quantization level, as shown in Fig. 2. For (1), (2), and (3) in Section 2.5.2, the data reduction for one minute of audio data at 44.1 kHz, 16 bits, a single channel, and volume of 87 KB was as follows:

(1)
D1(75 KB) → D1′(48 KB), D2(49 KB) → D2′(9 KB)
(2)
D1(86 KB) → D1′(55 KB), D2(57 KB) → D2″(21 KB)
(3)
D1(92 KB) → D1′(59 KB), D2(65 KB) → D2″′(29 KB)

4. Conclusion

We developed a secure communication method using a discrete wavelet transform for audio data; we used an increased number of quantization levels for the scaling coefficients along with a data reduction technique. The waveform produced by the proposed method was more similar to the original one than that produced by our previously proposed method.¹

References

1.Y Tsuda, K Nishimura, H Oyaizu, Y Yoshitomi, T Asada, and M Tabuse, A method for secure communication using a discrete wavelet transform for audio data, J. Robotics, Networking and Artif. Life, Vol. 3, No. 3, 2016, pp. 193-196.

2.M Goto, H Hashiguchi, T Nishimura, and R Oka, RWC music database: database of copyright-cleared musical pieces and instrument sounds for research purposes, Trans. IPSJ, Vol. 45, No. 3, 2004, pp. 728-738.

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Journal: Journal of Robotics, Networking and Artificial Life
Volume-Issue: 5 - 2
Pages: 93 - 96
Publication Date: 2018/09/30
ISSN (Online): 2352-6386
ISSN (Print): 2405-9021
DOI: 10.2991/jrnal.2018.5.2.4 How to use a DOI?
Open Access: This is an open access article under the CC BY-NC license (http://creativecommons.org/licences/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - Kouhei Nishimura
AU  - Yasunari Yoshitomi
AU  - Taro Asada
AU  - Masayoshi Tabuse
PY  - 2018
DA  - 2018/09/30
TI  - A Method for Secure Communication Using a Discrete Wavelet Transform for Audio Data and Improvement of Speaker Authentication
JO  - Journal of Robotics, Networking and Artificial Life
SP  - 93
EP  - 96
VL  - 5
IS  - 2
SN  - 2352-6386
UR  - https://doi.org/10.2991/jrnal.2018.5.2.4
DO  - 10.2991/jrnal.2018.5.2.4
ID  - Nishimura2018
ER  -

download .riscopy to clipboard