{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
]
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base_dir = 'D:\\\\deep_learning\\\\text'\n",
"%run ../initscript.py\n",
"# %run ../display.py\n",
"import pandas as pd\n",
"import numpy as np\n",
"import scipy.stats as st\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from ipywidgets import *\n",
"%matplotlib inline\n",
"import tensorflow as tf\n",
"tf.logging.set_verbosity(tf.logging.ERROR)\n",
"\n",
"import os\n",
"import random\n",
"import sys\n",
"\n",
"from keras import optimizers\n",
"from keras import backend as K\n",
"from keras import models\n",
"from keras import layers\n",
"from keras import initializers\n",
"from keras import preprocessing\n",
"from keras.utils import to_categorical, get_file\n",
"\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text Mining"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Deep learning models don't take raw text as input, they only work with numeric tersors. Vectorizing text is the process of transforming text into numeric tensors."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We break down text into different units. In particular, we convert a sentence into sequence of tokens, i.e., words or bag-of-words. $n$-grams is a group of $n$ or fewer consecutive words. Bag-of-words (or bag-of-$n$-grams) is a set of words (or grams) which are not necessary consecutive.\n",
"\n",
"To associate a vector with a token, one approach is one-hot encoding of tokens. One-hot encoding is the most basic way to turn a token into a vector which was applied to the IMDB and Reuters examples. It associates with a binary vector of size $N$, the size of the vocabulary, which is all-zeros except 1 for the i-th entry."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Word Embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word embedding is an approach to provide a dense vector representation of words (e.g. the cat is cute may be represented as [4,100,1,233]) that capture something about their meaning. The geometric relationships between word vectors should reflect the semantic relationships between these words. For example, 4 words are embedded on a 2-dimensional plane:\n",
"\n",
"With the vector representations we chose here, some semantic relationships between these words can be encoded as geometric transformations. For instance, the same vector allows us to go from cat to tiger and from dog to wolf : this vector could be interpreted as the \"from pet to wild animal\" vector. Similarly, another vector lets us go from dog to cat and from wolf to tiger, which could be interpreted as a \"from canine to feline\" vector.\n",
"\n",
"There are two ways to obtain word embeddings:\n",
"\n",
"- Learn word embeddings jointly with the main task such as document classification or sentiment prediction. In this setup, we would start with random word vectors, then learn word vectors as the weights of a neural network by using Embedding layer.\n",
"\n",
"- Load pre-computed word embeddings package which is obtained from a different machine learning task"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Learning Word Embedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Consider the IMDB movie review sentiment prediction. We\n",
"\n",
"- restrict the movie reviews to the top 10,000 most common words as we did before, and \n",
"\n",
"- cut the reviews after only 20 words. \n",
"\n",
"Our network will simply \n",
"\n",
"- learn 8-dimensional embeddings for each of the 10,000 words\n",
"\n",
"- turn the input integer sequences (2D integer tensor with shape (25000, 20) or `(samples, sequence_length)`) into embedded sequences (3D float tensor with shape (25000, 20, 8) or `(samples, sequence_length, embedding_dimensionality)`), \n",
"\n",
"- flatten the tensor to 2D, and \n",
"\n",
"- train a single Dense layer on top for classification."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding_1 (Embedding) (None, 20, 8) 80000 \n",
"_________________________________________________________________\n",
"flatten_1 (Flatten) (None, 160) 0 \n",
"_________________________________________________________________\n",
"dense_1 (Dense) (None, 1) 161 \n",
"=================================================================\n",
"Total params: 80,161\n",
"Trainable params: 80,161\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
]
}
],
"source": [
"from keras.datasets import imdb\n",
"\n",
"max_features = 10000\n",
"maxlen = 20\n",
"np_load_old = np.load\n",
"np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)\n",
"(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)\n",
"np.load = np_load_old\n",
"x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)\n",
"x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)\n",
"model = models.Sequential()\n",
"model.add(layers.Embedding(10000, 8, input_length=maxlen))\n",
"model.add(layers.Flatten())\n",
"model.add(layers.Dense(1, activation='sigmoid'))\n",
"model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])\n",
"model.summary()\n",
"#history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2, verbose=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We get to a validation accuracy of ~76%, which is pretty good considering that we only look at the first 20 words in every review. But note that merely flattening the embedded sequences and training a single Dense layer on top leads to a model that treats each word in the input sequence separately, without considering inter-word relationships and structure sentence.\n",
"\n",
"It would be much better to add recurrent layers or 1D convolutional layers (recognizing local patterns in a sentence or word sequence) on top of the embedded sequences to learn features that take into account each sequence as a whole."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pretrained Word Embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of using the pre-tokenized IMDB data packaged in Keras, we start from scratch by using the original text data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"imdb_dir = base_dir+'\\\\aclImdb'\n",
"train_dir = os.path.join(imdb_dir, 'train')\n",
"\n",
"labels = []\n",
"texts = []\n",
"\n",
"for label_type in ['neg', 'pos']:\n",
" dir_name = os.path.join(train_dir, label_type)\n",
" for fname in os.listdir(dir_name):\n",
" if fname[-4:] == '.txt':\n",
" f = open(os.path.join(dir_name, fname), encoding=\"utf8\")\n",
" texts.append(f.read())\n",
" f.close()\n",
" if label_type == 'neg':\n",
" labels.append(0)\n",
" else:\n",
" labels.append(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 25000 texts and labels.\n",
"\n",
"We restrict the training data to its first 200 samples. So we will be learning to classify movie reviews after looking at just 200 examples. The validation samples is 10,000.\n",
"\n",
"The texts are vectorized into sequences. The length of `sequences` is 25000, and `sequences[i]` is a list of integers corresponding to `texts[i]`.\n",
"\n",
"The `word_index` is a dictionary with words as keys and integer numbers as values, which is a mapping from words to integers."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found 88582 unique tokens.\n",
"Shape of data tensor: (25000, 100)\n",
"Shape of label tensor: (25000,)\n"
]
}
],
"source": [
"maxlen = 100 # We will cut reviews after 100 words\n",
"training_samples = 200\n",
"validation_samples = 10000\n",
"max_words = 10000 # We will only consider the top 10,000 words in the dataset\n",
"\n",
"tokenizer = preprocessing.text.Tokenizer(num_words=max_words)\n",
"tokenizer.fit_on_texts(texts)\n",
"sequences = tokenizer.texts_to_sequences(texts)\n",
"\n",
"word_index = tokenizer.word_index\n",
"print('Found %s unique tokens.' % len(word_index))\n",
"\n",
"data = preprocessing.sequence.pad_sequences(sequences, maxlen=maxlen)\n",
"\n",
"labels = np.asarray(labels)\n",
"print('Shape of data tensor:', data.shape)\n",
"print('Shape of label tensor:', labels.shape)\n",
"\n",
"# Split the data into a training set and a validation set\n",
"# But first, shuffle the data, since we started from data\n",
"# where sample are ordered (all negative first, then all positive).\n",
"indices = np.arange(data.shape[0])\n",
"np.random.shuffle(indices)\n",
"data = data[indices]\n",
"labels = labels[indices]\n",
"\n",
"x_train = data[:training_samples]\n",
"y_train = labels[:training_samples]\n",
"x_val = data[training_samples: training_samples + validation_samples]\n",
"y_val = labels[training_samples: training_samples + validation_samples]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use `glove.6B` the pre-computed embeddings from 2014 English Wikipedia."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found 400000 word vectors.\n"
]
}
],
"source": [
"glove_dir = base_dir+'\\\\glove.6B'\n",
"embeddings_index = {}\n",
"f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding=\"utf8\")\n",
"for line in f:\n",
" values = line.split()\n",
" word = values[0]\n",
" coefs = np.asarray(values[1:], dtype='float32')\n",
" embeddings_index[word] = coefs\n",
"f.close()\n",
"\n",
"print('Found %s word vectors.' % len(embeddings_index))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`glove.6B.100d.txt` has 400,000 rows where each row includes a word and a float array. For example, the first row likes\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need an embedding matrix to set `Embedding` layer's weight as the pretrained word embeddings. The embedding matrix must have shape `(max_words, embedding_dim)` ((10000, 100)), where each entry $i$ contains the embedding_dim, a dimensional vector for the word of index $i$. Note that `embedding_matrix[0]` needs to be a 0 array as a placeholder.\n",
"\n",
"We load the GloVe matrix (`embedding_matrix`) into Embedding layer and freeze the embedding layer. We can also try to train the same model without loading the pre-trained word embeddings and without freezing the embedding layer. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding_2 (Embedding) (None, 100, 100) 1000000 \n",
"_________________________________________________________________\n",
"flatten_2 (Flatten) (None, 10000) 0 \n",
"_________________________________________________________________\n",
"dense_2 (Dense) (None, 32) 320032 \n",
"_________________________________________________________________\n",
"dense_3 (Dense) (None, 1) 33 \n",
"=================================================================\n",
"Total params: 1,320,065\n",
"Trainable params: 1,320,065\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embedding_dim = 100\n",
"\n",
"embedding_matrix = np.zeros((max_words, embedding_dim))\n",
"for word, i in word_index.items():\n",
" embedding_vector = embeddings_index.get(word)\n",
" if i < max_words:\n",
" if embedding_vector is not None:\n",
" # Words not found in embedding index will be all-zeros.\n",
" embedding_matrix[i] = embedding_vector\n",
" \n",
"model = models.Sequential()\n",
"model.add(layers.Embedding(max_words, embedding_dim, input_length=maxlen))\n",
"model.add(layers.Flatten())\n",
"model.add(layers.Dense(32, activation='relu'))\n",
"model.add(layers.Dense(1, activation='sigmoid'))\n",
"model.summary()\n",
"\n",
"model.layers[0].set_weights([embedding_matrix])\n",
"model.layers[0].trainable = False\n",
"\n",
"model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])\n",
"history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val), verbose=0)\n",
"\n",
"acc = history.history['acc']\n",
"val_acc = history.history['val_acc']\n",
"loss = history.history['loss']\n",
"val_loss = history.history['val_loss']\n",
"epochs = range(1, len(acc) + 1)\n",
"\n",
"plt.figure(figsize=(12, 5))\n",
"plt.subplot(1, 2, 1)\n",
"plt.plot(epochs, acc, 'bo', label='Training')\n",
"plt.plot(epochs, val_acc, 'r', label='Validation')\n",
"plt.title('Training and validation accuracy')\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.plot(epochs, loss, 'bo', label='Training')\n",
"plt.plot(epochs, val_loss, 'r', label='Validation')\n",
"plt.title('Training and validation loss')\n",
"plt.legend(bbox_to_anchor=(1.02, 0.2), loc=2, borderaxespad=0.5)\n",
"plt.show()\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model quickly starts overfitting and validation accuracy stalls around 50% with only 200 samples. If you increase the number of training samples, this will quickly stop being the case."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"25000/25000 [==============================] - 1s 55us/step\n"
]
},
{
"data": {
"text/plain": [
"[0.7815723500633239, 0.5814]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_dir = os.path.join(imdb_dir, 'test')\n",
"\n",
"labels = []\n",
"texts = []\n",
"\n",
"for label_type in ['neg', 'pos']:\n",
" dir_name = os.path.join(test_dir, label_type)\n",
" for fname in sorted(os.listdir(dir_name)):\n",
" if fname[-4:] == '.txt':\n",
" f = open(os.path.join(dir_name, fname), encoding=\"utf8\")\n",
" texts.append(f.read())\n",
" f.close()\n",
" if label_type == 'neg':\n",
" labels.append(0)\n",
" else:\n",
" labels.append(1)\n",
"\n",
"sequences = tokenizer.texts_to_sequences(texts)\n",
"x_test = preprocessing.sequence.pad_sequences(sequences, maxlen=maxlen)\n",
"y_test = np.asarray(labels)\n",
"model.evaluate(x_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We tokenize the test data and evaluate the model on the test data. The test accuracy is around 50%."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Recurrent Neural Networks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple RNN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feedforward networks has no memory. However, a recurrent neural network (RNN) processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far. In effect, an RNN is a type of neural network that has an internal loop.\n",
"\n",
"\n",
"\n",
"The hidden state acts as the neural networks memory. It holds information on previous data the network has seen before.\n",
"\n",
"\n",
"\n",
"The hidden state is calculated as follows\n",
"\n",
"\n",
"**Pseudocode RNN**\n",
"```python\n",
"state_t = 0\n",
"for input_t in input_sequence:\n",
" output_t = activation(dot(W, input_t) + dot(U, state_t) + b)\n",
" state_t = output_t\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final output is a 2D tensor of shape `(timesteps, output_features)`, where each timestep is the output of the loop at time t. Each timestep t in the output tensor contains information about timesteps 0 to t in the input sequence—about the entire past. For this reason, in many cases, we don't need this full sequence of outputs; we just need the last output (output_t at the end of the loop), because it already contains information about the entire sequence."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding_3 (Embedding) (None, None, 32) 320000 \n",
"_________________________________________________________________\n",
"simple_rnn_1 (SimpleRNN) (None, 32) 2080 \n",
"=================================================================\n",
"Total params: 322,080\n",
"Trainable params: 322,080\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
]
}
],
"source": [
"model = models.Sequential()\n",
"model.add(layers.Embedding(10000, 32))\n",
"model.add(layers.SimpleRNN(32))\n",
"model.summary()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"embedding_4 (Embedding) (None, None, 32) 320000 \n",
"_________________________________________________________________\n",
"simple_rnn_2 (SimpleRNN) (None, None, 32) 2080 \n",
"=================================================================\n",
"Total params: 322,080\n",
"Trainable params: 322,080\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
]
}
],
"source": [
"model = models.Sequential()\n",
"model.add(layers.Embedding(10000, 32))\n",
"model.add(layers.SimpleRNN(32, return_sequences=True))\n",
"model.summary()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = models.Sequential()\n",
"model.add(layers.Embedding(10000, 32))\n",
"model.add(layers.SimpleRNN(32))\n",
"model.add(layers.Dense(1, activation='sigmoid'))\n",
"\n",
"model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])\n",
"history = model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2, verbose=0)\n",
"\n",
"acc = history.history['acc']\n",
"val_acc = history.history['val_acc']\n",
"loss = history.history['loss']\n",
"val_loss = history.history['val_loss']\n",
"epochs = range(1, len(acc) + 1)\n",
"\n",
"plt.figure(figsize=(12, 5))\n",
"plt.subplot(1, 2, 1)\n",
"plt.plot(epochs, acc, 'bo', label='Training')\n",
"plt.plot(epochs, val_acc, 'r', label='Validation')\n",
"plt.title('Training and validation accuracy')\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.plot(epochs, loss, 'bo', label='Training')\n",
"plt.plot(epochs, val_loss, 'r', label='Validation')\n",
"plt.title('Training and validation loss')\n",
"plt.legend(bbox_to_anchor=(1.02, 0.2), loc=2, borderaxespad=0.5)\n",
"plt.show()\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The validation accuracy is about 60%."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### LSTM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although `SimpleRNN` should theoretically be able to retain at time t information about inputs seen many timesteps before, in\n",
"practice, such long-term dependencies are impossible to learn. This is due to the vanishing gradient problem, an effect that is similar to what is observed with non-recurrent networks (feedforward networks) that are many layers deep: as you keep adding layers to a network, the network eventually becomes untrainable.\n",
"\n",
"For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in \"the clouds are in the sky,\" it's pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that its needed is small, RNNs can learn to use the past information.\n",
"\n",
"But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.\n",
"\n",
"Long Short-Term Memory (LSTM) saves information for later, thus preventing older signals from gradually vanishing during processing.\n",
"\n",
"**Pseudocode of the LSTM architecture**\n",
"```python\n",
"output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(c_t, Vo) + bo)\n",
"i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)\n",
"f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)\n",
"k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)\n",
"```\n",
"We obtain the new carry state (the next c_t) as follows\n",
"```python\n",
"c_t+1 = i_t * k_t + c_t * f_t\n",
"```\n",
"The multiplying `c_t` and `f_t` is a way to deliberately forget irrelevant information in the carry dataflow. Meanwhile, `i_t` and `k_t` provide information about the present, updating the carry track with new information."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = models.Sequential()\n",
"model.add(layers.Embedding(10000, 32))\n",
"model.add(layers.LSTM(32))\n",
"model.add(layers.Dense(1, activation='sigmoid'))\n",
"\n",
"model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])\n",
"history = model.fit(x_train, y_train, epochs=10, batch_size=128, \n",
" validation_split=0.2, verbose=0)\n",
"\n",
"acc = history.history['acc']\n",
"val_acc = history.history['val_acc']\n",
"loss = history.history['loss']\n",
"val_loss = history.history['val_loss']\n",
"epochs = range(1, len(acc) + 1)\n",
"\n",
"plt.figure(figsize=(12, 5))\n",
"plt.subplot(1, 2, 1)\n",
"plt.plot(epochs, acc, 'bo', label='Training')\n",
"plt.plot(epochs, val_acc, 'r', label='Validation')\n",
"plt.title('Training and validation accuracy')\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.plot(epochs, loss, 'bo', label='Training')\n",
"plt.plot(epochs, val_loss, 'r', label='Validation')\n",
"plt.title('Training and validation loss')\n",
"plt.legend(bbox_to_anchor=(1.02, 0.2), loc=2, borderaxespad=0.5)\n",
"plt.show()\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The validation accuracy is about 65%.\n",
"\n",
"Why isn't LSTM performing better than densely connect networks? \n",
"\n",
"One reason is that we made no effort to have more testing samples or tune hyperparameters such as the embeddings dimensionality or the LSTM output dimensionality. Another may be lack of regularization.\n",
"\n",
"The primary reason is that analyzing the global, long-term structure of the reviews (what LSTM is good at) isn't helpful for a sentiment-analysis problem. Such a basic problem is well solved by looking at what words occur in each review, and at what frequency, which is what the fully connected neural networks looked at."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GRU"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"GRU layers (which stands for \"gated recurrent unit\") work by leveraging the same principle as LSTM, but they are somewhat streamlined and thus cheaper to run, albeit they may not have quite as much representational power as LSTM. This trade-off between computational expensiveness and representational power is seen everywhere in machine learning."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 160 samples, validate on 40 samples\n",
"Epoch 1/10\n",
"160/160 [==============================] - 2s 13ms/step - loss: 0.6937 - acc: 0.4625 - val_loss: 0.6927 - val_acc: 0.5750\n",
"Epoch 2/10\n",
"160/160 [==============================] - 0s 893us/step - loss: 0.6871 - acc: 0.7562 - val_loss: 0.6920 - val_acc: 0.6250\n",
"Epoch 3/10\n",
"160/160 [==============================] - 0s 937us/step - loss: 0.6815 - acc: 0.8375 - val_loss: 0.6907 - val_acc: 0.5250\n",
"Epoch 4/10\n",
"160/160 [==============================] - 0s 993us/step - loss: 0.6747 - acc: 0.8812 - val_loss: 0.6897 - val_acc: 0.5750\n",
"Epoch 5/10\n",
"160/160 [==============================] - 0s 937us/step - loss: 0.6669 - acc: 0.9188 - val_loss: 0.6875 - val_acc: 0.6000\n",
"Epoch 6/10\n",
"160/160 [==============================] - 0s 918us/step - loss: 0.6573 - acc: 0.9188 - val_loss: 0.6846 - val_acc: 0.5500\n",
"Epoch 7/10\n",
"160/160 [==============================] - 0s 943us/step - loss: 0.6460 - acc: 0.9125 - val_loss: 0.6827 - val_acc: 0.6500\n",
"Epoch 8/10\n",
"160/160 [==============================] - 0s 906us/step - loss: 0.6323 - acc: 0.9375 - val_loss: 0.6786 - val_acc: 0.6000\n",
"Epoch 9/10\n",
"160/160 [==============================] - 0s 924us/step - loss: 0.6158 - acc: 0.9500 - val_loss: 0.6748 - val_acc: 0.5750\n",
"Epoch 10/10\n",
"160/160 [==============================] - 0s 931us/step - loss: 0.5956 - acc: 0.9375 - val_loss: 0.6712 - val_acc: 0.5500\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = models.Sequential()\n",
"model.add(layers.Embedding(10000, 32))\n",
"model.add(layers.GRU(32))\n",
"model.add(layers.Dense(1, activation='sigmoid'))\n",
"\n",
"model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])\n",
"history = model.fit(x_train, y_train, epochs=10, batch_size=128, \n",
" validation_split=0.2, verbose=1)\n",
"\n",
"acc = history.history['acc']\n",
"val_acc = history.history['val_acc']\n",
"loss = history.history['loss']\n",
"val_loss = history.history['val_loss']\n",
"epochs = range(1, len(acc) + 1)\n",
"\n",
"plt.figure(figsize=(12, 5))\n",
"plt.subplot(1, 2, 1)\n",
"plt.plot(epochs, acc, 'bo', label='Training')\n",
"plt.plot(epochs, val_acc, 'r', label='Validation')\n",
"plt.title('Training and validation accuracy')\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.plot(epochs, loss, 'bo', label='Training')\n",
"plt.plot(epochs, val_loss, 'r', label='Validation')\n",
"plt.title('Training and validation loss')\n",
"plt.legend(bbox_to_anchor=(1.02, 0.2), loc=2, borderaxespad=0.5)\n",
"plt.show()\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The validation accuracy is about 60%."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regularization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We consider a weather timeseries dataset recorded at the Weather Station at the Max-Planck-Institute for Biogeochemistry in Jena, Germany. In this dataset, 14 different quantities (such air temperature, atmospheric pressure, humidity, wind direction, etc.) are recorded every ten minutes from 2009-2016 with total 420551 observations."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fname = os.path.join(base_dir, 'jena_climate_2009_2016.csv')\n",
"\n",
"f = open(fname, encoding=\"utf8\")\n",
"data = f.read()\n",
"f.close()\n",
"\n",
"lines = data.split('\\n')\n",
"header = lines[0].split(',')\n",
"lines = lines[1:]\n",
"\n",
"float_data = np.zeros((len(lines), len(header) - 1))\n",
"for i, line in enumerate(lines):\n",
" values = [float(x) for x in line.split(',')[1:]]\n",
" float_data[i, :] = values\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot of temperature (in degrees Celsius) over time"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"temp = float_data[:, 1] # temperature (in degrees Celsius)\n",
"plt.figure(figsize=(12, 5))\n",
"plt.plot(range(len(temp)), temp)\n",
"plt.show() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot of the first ten days of temperature data (since the data is recorded every ten minutes, we get 144 data points per day):"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(range(1440), temp[:1440])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Preparation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `generator` yields a tuple (samples, targets) where samples is one batch of input data and targets is the corresponding array of target temperatures. It takes the following arguments:\n",
"\n",
"- data: The original array of floating point data, which we just normalized in the code snippet above.\n",
"\n",
"- lookback: How many timesteps back should our input data go.\n",
"\n",
"- delay: How many timesteps in the future should our target be.\n",
"\n",
"- min_index and max_index: Indices in the data array that delimit which timesteps to draw from. This is useful for keeping a segment of the data for validation and another one for testing.\n",
"\n",
"- shuffle: Whether to shuffle our samples or draw them in chronological order.\n",
"\n",
"- batch_size: The number of samples per batch.\n",
"\n",
"- step: The period, in timesteps, at which we sample data. We will set it 6 in order to draw one data point every hour."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def generator(data, lookback, delay, min_index, max_index,\n",
" shuffle=False, batch_size=128, step=6):\n",
" if max_index is None:\n",
" max_index = len(data) - delay - 1\n",
" i = min_index + lookback\n",
" while 1:\n",
" if shuffle:\n",
" rows = np.random.randint(\n",
" min_index + lookback, max_index, size=batch_size)\n",
" else:\n",
" if i + batch_size >= max_index:\n",
" i = min_index + lookback\n",
" rows = np.arange(i, min(i + batch_size, max_index))\n",
" i += len(rows)\n",
"\n",
" samples = np.zeros((len(rows),\n",
" lookback // step,\n",
" data.shape[-1]))\n",
" targets = np.zeros((len(rows),))\n",
" for j, row in enumerate(rows):\n",
" indices = range(rows[j] - lookback, rows[j], step)\n",
" samples[j] = data[indices]\n",
" targets[j] = data[rows[j] + delay][1]\n",
" yield samples, targets\n",
" \n",
"toggle()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"mean = float_data[:200000].mean(axis=0)\n",
"float_data -= mean\n",
"std = float_data[:200000].std(axis=0)\n",
"float_data /= std\n",
"\n",
"lookback = 1440\n",
"step = 6\n",
"delay = 144\n",
"batch_size = 128\n",
" \n",
"train_gen = generator(float_data, lookback=lookback, delay=delay, min_index=0,\n",
" max_index=200000, shuffle=True, step=step, batch_size=batch_size)\n",
"val_gen = generator(float_data, lookback=lookback, delay=delay, min_index=200001,\n",
" max_index=300000, step=step, batch_size=batch_size)\n",
"test_gen = generator(float_data, lookback=lookback, delay=delay, min_index=300001,\n",
" max_index=None, step=step, batch_size=batch_size)\n",
"\n",
"# This is how many steps to draw from `val_gen`\n",
"# in order to see the whole validation set:\n",
"val_steps = (300000 - 200001 - lookback) // batch_size\n",
"\n",
"# This is how many steps to draw from `test_gen`\n",
"# in order to see the whole test set:\n",
"test_steps = (len(float_data) - 300001 - lookback) // batch_size "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Baseline Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A naive model predict that the temperature 24 hours from now will be equal to the temperature right now. We can evaluate this approach using the Mean Absolute Error metric (MAE)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.2897359729905486\n"
]
}
],
"source": [
"def evaluate_naive_method():\n",
" batch_maes = []\n",
" for step in range(val_steps):\n",
" samples, targets = next(val_gen)\n",
" preds = samples[:, -1, 1]\n",
" mae = np.mean(np.abs(preds - targets))\n",
" batch_maes.append(mae)\n",
" print(np.mean(batch_maes))\n",
"evaluate_naive_method()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It yields a MAE of 0.29. Since our temperature data has been normalized to be centered on 0 and have a standard deviation of one, this number is not immediately interpretable. It translates to an average absolute error of 0.29 * temperature_std degrees Celsius (`np.std(float_data[:200000, 1])` = 8.48 before normalization), i.e. 2.57. Now the game is to leverage the deep learning models to do better."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Basic Neural Network**: a simply fully-connected model:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def basic_nn():\n",
" model = models.Sequential()\n",
" model.add(layers.Flatten(input_shape=(lookback // step, float_data.shape[-1])))\n",
" model.add(layers.Dense(64, activation='relu'))\n",
" model.add(layers.Dense(1))\n",
"\n",
" model.compile(optimizer=optimizers.RMSprop(), loss='mae')\n",
" history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20,\n",
" validation_data=val_gen, validation_steps=val_steps)\n",
" df = pd.DataFrame.from_dict(data=history.history, orient='columns')\n",
" df.to_csv(base_dir+'\\\\basic_nn.csv', header=True, index=False)\n",
" K.clear_session()\n",
" del model\n",
" \n",
"df = pd.read_csv(base_dir+'\\\\basic_nn.csv')\n",
"history = df.to_dict()\n",
"\n",
"loss = list(history['loss'].values())[1:]\n",
"val_loss = list(history['val_loss'].values())[1:]\n",
"epochs = range(len(loss))\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(epochs, loss, 'bo', label='Training')\n",
"plt.plot(epochs, val_loss, 'r', label='Validation')\n",
"plt.title('Training and validation loss')\n",
"plt.legend(bbox_to_anchor=(1.02, 0.2), loc=2, borderaxespad=0.5)\n",
"plt.show()\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The validation losses are close to 0.32 and worse than the no-learning baseline. It turns out not to be so easy to outperform the naive model. The naive model already contains valuable information that a machine learning model does not have access to.\n",
"\n",
"If there exists a simple, well-performing model (naive model), why doesn't the model we are training find it and improve on it? \n",
"\n",
"- The hypothesis space, the space of models in which we are searching for a solution, is the space of all possible 2-layer networks with the configuration that we defined. \n",
"\n",
"\n",
"\n",
"- When looking for a solution with a space of complicated models, the simple well-performing baseline might be unlearnable, even if it's technically part of the hypothesis space. \n",
"\n",
"\n",
"\n",
"- That is a pretty significant limitation of machine learning in general: unless the learning algorithm is hard-coded to look for a specific kind of simple model, parameter learning can sometimes fail to find a simple solution to a simple problem."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Basic Recurrent Neural Network**: a simply GRU model:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAdwAAAE/CAYAAADsc3LZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3deXyU5fX38c9hBxEUAVG2oBKVRVRSkUURl4pKReuGRNQqVWv9uVStVqxPHyrVp1qr9ueGaKuCW0WtVdxq3alIUBCURcSwb7IjUAg5zx/XRIaQZUIm98xkvu/Xa17J3OuZO4GT67qv+1zm7oiIiEjNqpPqAERERLKBEq6IiEgElHBFREQioIQrIiISASVcERGRCCjhioiIREAJV6rMzOqa2UYz65DMbVPJzA4ys6Q/I2dmJ5pZYdz72WZ2TCLb7sa5xpjZLbu7fwXHvd3M/pbs44pkm3qpDkBqnpltjHvbBPgvsD32/nJ3H1eV47n7dqBpsrfNBu5+cDKOY2bDgQvc/bi4Yw9PxrFFpGYo4WYBd/8h4cVaUMPd/V/lbW9m9dy9KIrYRESyhbqUpaTL8Dkze8bMNgAXmFlvM/vEzNaa2VIzu9/M6se2r2dmbmY5sfdjY+tfN7MNZvYfM+tU1W1j608xszlmts7M/mJmH5vZxeXEnUiMl5vZXDNbY2b3x+1b18z+bGarzOwbYGAF1+dWM3u21LIHzOye2PfDzWxm7PN8E2t9lnesRWZ2XOz7Jmb2VCy2L4GeZZx3Xuy4X5rZ6bHl3YH/BY6Jddd/F3dtfxe3/xWxz77KzF42s/0SuTaVMbMzYvGsNbN/m9nBcetuMbMlZrbezGbFfdajzeyz2PLlZnZXoucTqS2UcKXEmcDTQHPgOaAIuAZoCfQlJKTLK9h/KPBboAWwAPh9Vbc1s9bA88CNsfN+CxxVwXESifFUQiI7gvCHxImx5b8Afgz0iJ3j3ArO8zQwyMz2iMVZDzgnthxgOXAa0Az4OfAXMzusguOVGAm0Bw6IxXlRqfVzYp+rOTAKeNrM9nX36cBVwIfu3tTdW5Y+sJn9OHb8s4G2wBKg9K2D8q5NuczsUGAs8D9AK+BfwD/NrL6ZdSVc/yPdvRlwCuHnC/AX4K7Y8oOAFyo7l0hto4QrJT5y93+6e7G7b3b3ye4+yd2L3H0eMBroX8H+L7h7gbtvI/zHfvhubDsImOru/4it+zPwXXkHSTDGO9x9nbsXAu/Fnetc4M/uvsjdVwF3VnCeecAMYHBs0UnAWncviK3/p7vP8+DfwDtAmQOjSjkXuN3d17j7fEKrNf68z7v70tjP5GmgEMhL4LgA+cAYd5/q7luAm4H+ZtYubpvyrk1FhgCvuPu/Yz+jOwl/aPQi/AHUCOgauy3xbezaAWwDOpvZPu6+wd0nJfg5RGoNJVwpsTD+jZkdYmavmdkyM1tPaC3t0pKKsyzu+01UPFCqvG33j4/Dw8wai8o7SIIxJnQuYH4F8UJozZ4f+34oca1FMxtkZpPMbLWZrSW0nCu6ViX2qygGM7vYzKbFum7XAockeFwIn++H47n7emANobVboio/s/KOW0z4GbV199nA9YSfwwoLtyjaxDb9GdAFmG1mn5rZqQl+DpFaQwlXSpR+JOYRQqvuoFg34G2A1XAMS4EfWmBmZuycIEqrToxLCd25JSp7bOk54MRYC3Ewse5kM2tM6B69A9jX3fcC3kowjmXlxWBmBwAPEbq+94kdd1bccSt7hGkJ0DHueHsCewOLE4irKsetQ/iZLQZw97Hu3hfoBNQlXBfcfba7DwFaA38CxptZo2rGIpJRlHClPHsC64DvY/ftKrp/myyvAkea2U9i90mvIdwnrIkYnweuNbO2ZrYPcFNFG7v7cuAj4K/AbHf/OraqIdAAWAlsN7NBwAlViOEWM9vLwnPKV8Wta0pIqisJf3sMJ7RwSywH2pUMEivDM8ClZnaYmTUkJL4P3b3cHoMqxHy6mR0XO/eNwAZgkpkdamYDYufbHHttJ3yAYWbWMtYiXhf7bMXVjEUkoyjhSnmuJwzi2UBoST5X0yeMJbXzgHuAVcCBwOeE54aTHeNDhHut04HJJDaI52ngRHYMlsLd1wLXAS8BqwmDlF5NMIb/Q2hpFwKvA0/GHfcL4H7g09g2hwDx9z3fBr4GlptZfNdwyf5vELp2X4rt34FwX7da3P1LwjV/iPDHwEDg9Nj93IbAHwn33ZcRWtS3xnY9FZhpYRT83cB57r61uvGIZBLTBPSSrsysLqEL82x3/zDV8YiIVIdauJJWzGygmTWPdUv+ljDy9dMUhyUiUm1KuJJu+gHzCN2SA4Ez3L28LmURkYyhLmUREZEIqIUrIiISASVcERGRCKTlbEEtW7b0nJycVIchIpIxpkyZ8p27V/TcuqRYWibcnJwcCgoKUh2GiEjGMLPKypNKiqlLWUREJAJKuCIiIhFQwhUREYmAEq6IiEgElHBFREQioIQrIiISASVcERGRCNSahDtuHOTkQJ064eu4camOSEREZIeEEm5syrTZZjbXzG4uY/1gM/vCzKaaWYGZ9Ut032QYNw4uuwzmzwf38PWyy5R0RUQkfVQ6W1BsEvA5wEnAImAycL67fxW3TVPge3d3MzsMeN7dD0lk37Lk5eV5VSpN5eSEJFtax45QWJjwYUREMpaZTXH3vFTHIeVLpIV7FDDX3ee5+1bgWWBw/AbuvtF3ZO49AE9032RYsKBqy0VERKKWSMJtCyyMe78otmwnZnammc0CXgMuqcq+1dWhQ9WWi4iIRC2RhGtlLNulH9rdX3L3Q4AzgN9XZV8AM7ssdv+3YOXKlQmEtcOoUdCkyc7LmjQJy0VERNJBIgl3EdA+7n07YEl5G7v7B8CBZtayKvu6+2h3z3P3vFatqjbDVH4+jB4d7tmaha+jR4flIiIi6SCR6fkmA53NrBOwGBgCDI3fwMwOAr6JDZo6EmgArALWVrZvsuTnK8GKiEj6qjThunuRmV0FvAnUBR539y/N7IrY+oeBs4ALzWwbsBk4LzaIqsx9a+iziIikzpo1MG8eHH441K2b6mgkDVX6WFAqVPWxIBGRlJozB048ERYuhH33hTPOgLPOguOOg/r1IwlBjwWlv1pTaUpEJCWmToV+/WDLFnjwQTj2WBg7Fn7845B8L74YXnklrJespoQrIrK7Pv44tGIbNYKPPoJf/AKefx5WroSXX4af/AT+8Q8YPBhatYIhQ8L6jRtTHbmkgBKuiMjueOutHa3Yjz6C3Nwd6xo3Dkn2iSdg+XJ4800YOhTefRfOOw9atgzrn3wy3PuVrKCEKyJSVS+8AIMGQefO8MEHFVfZadAgJOZHHoElS+D99+GKK+Dzz+Gii6B1azj55LB++fLoPoNEToOmRESq4vHH4ec/h9694dVXYa+9du847lBQAOPHh9fcudC8eeiO3o2BVho0lf4SeQ5XREQA7r0XrrsutFhffBH22GP3j2UGP/pReN1xB8yYAbNmRTaqWaKnhCsiUhl3+L//N7zOOivM/dmwYfKObwbdu4eX1FpKuCIiFSkuDq3a+++Hn/0s1I2tp/86peo0aEpEpDxFRXDJJSHZXnstjBmjZCu7Tb85IiJl+e9/4fzz4aWXQlfyb38bun5FdpMSrohIad9/D2eeCW+/DffdB1dfneqIpBZQl7KIpIcZM+Bf/0p1FKEQxUknwTvvwN/+pmQrSaOEKyKpt2EDDBwYEt2ll4YWZiosWxZKNU6ZAn//eyhMIZIkSrgiknq33RaqMF18Mfz1r3DkkfDZZ9HG8O23cMwxoQDFq6/CT38a7fml1lPCFZHU+vzzMAr4sstCsn3nndDCPfpo+NOfwmM5NWnLFvjDH6BbN/juu3Df9qSTavackpWUcEUkdbZvh8svD8X877gjLBswAKZNg9NOgxtugFNOCV29yeYeps3r2hVGjAjn+fxz6NMn+ecSQQlXRFLp4Ydh8mT4859h7713LN9nn1A68aGHwuQAhx0GEyYk77yzZoUEO3hwmFrv7bfDhAQ5Ock7h0gpSrgikhpLlsAtt8CJJ4bnXUszC7PqTJkCbdqEFu+111ZvIvf16+HGG0MJxU8+CbWRp04NMYjUMCVcEUmNX/0qFJd48MGKC0p06QKffhoez7nvvnBvd+bMqp2ruDjMTZubG+4LX3wxzJkD11yjyQIkMkq4IhK9N9+E554LLdzOnSvfvlGjkGxffRUWL4aePUNN40SmF508OdyXvfhi6NQJJk2CRx8N89CKREgJV0SitXkzXHllaG3edFPV9j3tNPjiC+jXLwy2OvtsWL267G2XLw/P9B51FBQWhhbuxx+H6fBEUkAJV0SidfvtMG9eGDC1O1Pc7bcfvPEG3H03/POfYUDVe+/tWL9tW7g3m5sLTz0V7tnOmQMXXgh19F+epI5++0QkOl99BXfdBcOGhcd/dledOnD99fCf/0CTJnD88XDrraGrukePMJ1e794wfTr88Y/QrFnyPoPIbtLkBSISDXf4xS+gadPQOk2Gnj1DRaprroFRo8KyAw4Iz9cOGqTZfSStKOGKSDT+9rfwTG2yByw1bQqPPRYS7Pz54VGiRo2Sd3yRJFHCFZGa99134V5q375hQveacOaZNXNckSTRPVwRqXm//jWsWxcGSmngkmQp/eaLSM364IMwKcH114cJAkSylBKuiNScrVvDPdWcnDAFn0gW0z1cEak5d98dyjC++mp4fEcki6mFKyI145tv4Pe/h7POChWiRLKcEq6IJJ87/PKXYWKA++5LdTQiaUFdyiKSfM8/H6o+3XcftG2b6mhE0oJauCKSXOvWhXlrjzwytHJFBFDC/cG4cWEgZZ064eu4camOSCRDjRgBK1bAI49A3bqpjkYkbahLmZBcL7sMNm0K7+fPD+8B8vNTF5dIxvn00zCh/FVXQV5eqqMRSSvmiUzgHLG8vDwvKCiI7Hw5OSHJltaxY5hGU0QSUFQU5p5dtgxmzdIMPREzsynurr9y0phauMCCBVVbLiJxtmwJ0+SNHQuffx4GTCnZiuxCCRfo0KHsFm6HDtHHIpL2tm2DggL497/D6+OP4b//DQMgLrkEzj471RGKpCUlXMI0mvH3cCEUxSmZXlMkq23fDtOm7UiwH34IGzeGdT16wJVXhgngjzkGmjdPbawiaUwJlx0Do0aMCN3IHTqEZKsBU5KV3OGrr0JyffddeO89WLMmrDvkELjwQhgwAI47Dlq2TGWkIhlFCTcmP18JVjLcm2+G7t3d5Q5z54Yku3x5WNapU5hn9vjjQ5Ldf//kxCqShZRwRTLd99/DddfBo4+G92a7f6w2beDEE3ck2E6dkhOjiCjhimS0yZND18zcuXDTTTByJDRokOqoRKQMCVWaMrOBZjbbzOaa2c1lrM83sy9ir4lm1iNu3XVm9qWZzTCzZ8ysUTI/gEhW2r49DDTo0wc2bw73W++8U8lWJI1VmnDNrC7wAHAK0AU438y6lNrsW6C/ux8G/B4YHdu3LXA1kOfu3YC6wJDkhS+ShQoLw4ClW28Nj+B88UV4LyJpLZEW7lHAXHef5+5bgWeBwfEbuPtEd48NY+QToF3c6npAYzOrBzQBllQ/bJEs5B6KS/ToEZLsU0/B00/D3nunOjIRSUAiCbctsDDu/aLYsvJcCrwO4O6LgbuBBcBSYJ27v7V7oYpksTVr4PzzYdgwOOyw8FzsBRdUb4CUiEQqkYRb1r/oMgswm9kAQsK9KfZ+b0JruBOwP7CHmV1Qzr6XmVmBmRWsXLkykdhFssN774VW7fjxcPvt4X1OToqDEpGqSiThLgLax71vRxndwmZ2GDAGGOzuq2KLTwS+dfeV7r4NeBHoU9ZJ3H20u+e5e16rVq2q8hlEaqetW8PI4+OPh0aNYOLEUJ1FU96JZKREEu5koLOZdTKzBoRBT6/Eb2BmHQjJdJi7z4lbtQA42syamJkBJwAzkxO6SC02cyYcfTT88Y/w85+HSQF+9KNURyUi1VBpwnX3IuAq4E1Csnze3b80syvM7IrYZrcB+wAPmtlUMyuI7TsJeAH4DJgeO9/o5H+M1NME9pIU7mE+2SOPDHVGX345TOS+xx6pjkxEqknz4SZB6QnsIUx+MHq0ykVKFSxfDpdeCq+9BiefDH/9K+y3X6qjkgyh+XDTX0KFL6RiI0bsnGwhvB8xIjXxSIbZvDkUrcjNhX/9C+67DyZMULIVqWWUcJNAE9jLbtm+HZ54IiTa3/wG+veHqVPh6qvDvQkRqVX0rzoJypuoXhPYS7neegt69oSLLw4t2ffeg1deCdPfiUitpISbBKNGhXu28TSBvZRp2rRwf/bkk2H9enj2Wfjkk9C6FZFaTQk3CfLzwwCpjh1D4Z+OHTVgSkpZuBAuugiOOAIKCuDPfw6P/px3nrqPRbKEpudLEk1gL2Vatw7uuCMMhHKHG28M92v32ivVkYlIxJRwRWrC1q3w0EPw+9/D6tWh7vHtt+vGvkgWq119WWn4TLFkGXd4/nk49FC49trQhTxlCjz5pJKtSJarPQl348YwEOWpp1IdiWSj7dvhzTdDOcbzzguVod54I4xGPuKIVEcnImmg9nQp168f/tO75BJo3TokX5GaNmdOeJb2ySdh0SJo2zZUiBo2TJMMiMhOak8Lt2FDeOkl6NoVzjoLJk9OdURSW61dG4ah9+kDBx8cqkR17w7PPQdz54Zna5VsRaSU2pNwAZo1g9dfh1at4LTT4OuvUx2R1Bbbt4cu4vPPhzZt4PLLwwjkP/4xtGwnTIBzzw3T6ImIlKH2dCmX2G+/cC+tb9/QrTxxYvgPUmqvbdvg009DHeItW+DAA8PrgAOgXbvqtTZnzgxdxk89BUuWQIsWMHx4aMX27BkevBYRSUDtS7gQatO+9hoMGACnnhrK5jVrluqoJFncQ9ftW2/B22/Dv/8NGzaE5Fe3LhQV7di2QYMwX2JJEo5/deoEjRvvevzVq0MFqCeeCIm8bl045RS4/34YNCjcvhARqaLamXABjjoKxo+Hn/wEfvrTkIDT+D/KcePC7EILFoSnR0aNUiGNnaxeDe+8syPJzp8flufkhG7ek06C448Pf1gtWgTffLPr66OPQmKOt//+O7eIp0+Hf/wjPEfbvTv86U/hB7HvvpF/ZBGpXWr/fLhPPhlK6p13Hjz9dFqW0dN8umXYujXcDnj77fAqKAgt22bNQmL98Y9Dkj3wwMS7dd1h1aqyk/E338DSpdCyJQwdGrqMDz9cXcaSMTQfbvqr/QkXwsCWm26Ca64JNWzT7D/RnJwdDbZ4HTtCYWHU0aRQYWFoXb71Frz/Pnz/fejOPfrokFxPOin0XNSroY6ZzZvD42U1dXyRGqSEm/6y43+WG28MA17uuy90If7616mOaCdZP5/u0qUwciSMGRPuv3buHFqYJ50Exx0HzZtHE0dZ93NFRJIkOxKuGdxzDyxbFlq6bdrAhRemOqofdOhQdgu31lcCXLs29D7ce28YaXzZZXD99eFeqohILZN+NzRrSp06YdTpCSfApZeGZyrTRNbNp7t5M9x9d0isd9wBZ5wBs2bBAw8o2YpIrZU9CRfCKOUXXwyjT886KzzykQaSMZ/uuHHhXnCdOuHruHE1FW01FBXBY4+Fx7ZuvBF69YLPPw+D2Q48MNXRiYjUqOwYNFXasmWhLN+GDfDxxyEBZLC0H+XsDi+/HJ57mjkzJNo77wz3Z0UkKTRoKv1lVwu3RJs2oRqVWahGtXRp9Y9ZXAzz5oXHWSI2YsTOyRbC+xEjIg9lV++9B717h2eh3UMPw3/+o2QrIlknOxMuhJGwr70GK1eGKkLr1iW+b3FxmCXmmWfCIJ/+/cNI2gMPhB/9CBYvrrm4y5CWo5ynTg3XdcCAUIhizJhQVOLMM9PusSwRkShkxyjl8vzoR6Ea1aBBoQU2YcKu1ajcQ8u1oCBMJF7ydf36sL5Ro1Ag4eKLw7DikSNDi+7118PMRRFIq1HO33wDv/1t+GNk773hrrvgl7/UIzcikvWyO+FC6FJ+/PHwmNCFF4ZRs599tnOCXbs2bNugAfToEW6M5uWF16GHhmIJJU48MdRv7tcv3Lfs37/GP8KoUWXfw41slPOWLaGe8d//DmPHhuvxm9+E55332iuiIERE0lt2Dpoqy1137VwQo359OOywMCNMSXLt2jUk3crMnw8DB4aW8VNPhWnbaljktZjXrg09Ai+/HFrzGzfCnnvCBReEFu5++9XgyUWkNA2aSn9KuCXc4fnnYc2akFy7d6/eZAerV8PgwaFg/p/+BL/6VfJiTZXFi+GVV+Cll+Ddd8NjPvvuGz7nmWeG+7VpPEGESG2mhJv+1KVcwixMcJAsLVqEovvDhoWBVQsWhMRbnblZo+YeClK8/HJ4lTy33Llz+APijDPCIz5pOCGEiEi6UcKtSY0awXPPhYR7771htO7YsWF5uiouhkmTdiTZOXPC8qOOgj/8ISTZQw7RSGMRkSpSwq1pdeqEGYo6dAitwuXLw4w4LVok7xybNoWbuA89FCZpqFcvtKTr1dvxqux9vXohiU6eHAqD1KsXuoivuQZOPx3atUtevCIiWUgJNyrXXQdt24Yu5r59w0CjnJzqHbOwEB58MDzjumZNeDzpjDNg+/Zwf7XkVdH7TZt2XtavX7gfe+qpGmEsIpJESrhROvfcUOVq8ODwrO6ECXDEEVU7hnuYK/b++0NL2Sw8Q/w//xOSpbp6RUTSkka7RO3YY8PI5fr1w/dvvZXYfps2hZZsjx6hq/eDD8JUg99+G0ZXH3OMkq2ISBpTwk2Frl3hk09CKcjTToO//a38bRcsgJtvhvbt4ec/D0n1scdg4cIwiKl9+8jCFhGR3acu5VTZf//QSj3rLPjZz8II5hEjQkJ1hw8/DN3GL70Utj/zTLj6arVkRUQylBJuKjVrFiZQGD48VGdasACOPjok2mnTwkjmG2+EK69MUWFkERFJFnUpp1qDBvDEE3DLLfDoo3DppeFZ2EcfDd3Gd94ZSbLNiAnsRUQymFq46cAsFD/u3RuaNg0THkTYbVx6Avv588N7SJMJ7EVEagHVUhZycsqe3q9jx/Cor4ikP9VSTn/qUpb0nMBeRKSWUcKVcm8Ra5yWiEjyKOEKo0aFCevjRTqBvYhIFlDCFfLzYfTocM/WLHwdPVoDpkREkimhhGtmA81stpnNNbOby1ifb2ZfxF4TzaxH3Lq9zOwFM5tlZjPNrHcyP4AkR35+GCBVXBy+KtmKiCRXpY8FmVld4AHgJGARMNnMXnH3r+I2+xbo7+5rzOwUYDTQK7buPuANdz/bzBoApTovRUREar9EWrhHAXPdfZ67bwWeBQbHb+DuE919TeztJ0A7ADNrBhwLPBbbbqu7r01W8CIiIpkikYTbFlgY935RbFl5LgVej31/ALAS+KuZfW5mY8xsj92KVEREJIMlknDLKnlUZrUMMxtASLg3xRbVA44EHnL3I4DvgV3uAcf2vczMCsysYOXKlQmEJSIikjkSSbiLgPg54NoBS0pvZGaHAWOAwe6+Km7fRe4+Kfb+BUIC3oW7j3b3PHfPa9WqVaLxi4iIZIREEu5koLOZdYoNehoCvBK/gZl1AF4Ehrn7nJLl7r4MWGhmB8cWnQDED7YSERHJCpUmXHcvAq4C3gRmAs+7+5dmdoWZXRHb7DZgH+BBM5tqZvGFkP8HGGdmXwCHA39I6ieQtKDZhkREKqbJC6TaSs82BKFSlYpniERHkxekP1WakmobMWLnZAvh/YgRqYlHRCQdKeFKtWm2IRGRyinhSrVptiERkcop4Uq1abYhEZHKKeFKtWm2IRGRylU6eYFIIvLzlWBFRCqiFq6IiEgElHBFREQioIQrIiISASVcSQsqDSkitZ0GTUnKlS4NOX9+eA8aiCUitYdauJJyKg0pItlACVdSTqUhRSQbKOFKyqk0pIhkAyVcSTmVhhSRbKCEKymn0pAikg00SlnSgkpDikhtpxauiIhIBJRwRUREIqCEKyIiEgElXBERkQgo4YqIiERACVdqBU1+ICLpTo8FScbT5AcikgnUwpWMp8kPRCQTKOFKxtPkByKSCZRwJeNp8gMRyQRKuJLxNPmBiGQCJVzJeJr8QEQygUYpS62gyQ9EJN2phSsiIhIBJVwRVDhDRGqeupQl66lwhohEQS1cyXoqnCEiUVDClaynwhkiEgUlXMl6KpwhIlFQwpWsp8IZIhIFJVzJeiqcISJR0ChlEVQ4Q0Rqnlq4IiIiEVDCFRERiYASroiISASUcEVERCKghCuSBKrFLCKVSSjhmtlAM5ttZnPN7OYy1ueb2Rex10Qz61FqfV0z+9zMXk1W4CLpoqQW8/z54L6jFrOSrojEqzThmlld4AHgFKALcL6ZdSm12bdAf3c/DPg9MLrU+muAmdUPVyT9qBaziCQikRbuUcBcd5/n7luBZ4HB8Ru4+0R3XxN7+wnQrmSdmbUDTgPGJCdkkfSiWswikohEEm5bYGHc+0WxZeW5FHg97v29wK+B4opOYmaXmVmBmRWsXLkygbBE0oNqMYtIIhJJuFbGMi9zQ7MBhIR7U+z9IGCFu0+p7CTuPtrd89w9r1WrVgmEJZIeVItZRBKRSMJdBLSPe98OWFJ6IzM7jNBtPNjdV8UW9wVON7NCQlf08WY2tloRi6QZ1WIWkUSYe5mN1R0bmNUD5gAnAIuBycBQd/8ybpsOwL+BC919YjnHOQ64wd0HVRZUXl6eFxQUJPoZRESynplNcfe8VMch5at08gJ3LzKzq4A3gbrA4+7+pZldEVv/MHAbsA/woJkBFOkHLyIiskOlLdxUUAtXRKRq1MJNf6o0JZIGVKlKpPbTfLgiKVZSqaqkeEZJpSrQwCuR2kQtXJEUU6UqkeyghCuSYqpUJZIdlHBFUkyVqkSygxKuSIqpUpVIdlDCFUkxVaoSyQ5KuCJpID8fCguhuDh8rWqy1WNFIulPjwWJZDg9ViSSGdTCFclweqxIJDMo4YpkOD1WJJIZlHBFMpweKxLJDEq4IhlOjxWJZAYlXJEMp8eKRDKDRimL1AL5+UqwIulOLVwREZEIKOGKiApniERAXQHJdqMAABE/SURBVMoiWU6FM0SioRauSJZT4QyRaCjhimQ5Fc4QiYYSrkiWU+EMkWgo4YpkORXOEImGEq5IllPhDJFoaJSyiKhwhkgE1MIVERGJgBKuiFSbCmeIVE5dyiJSLSqcIZIYtXBFpFpUOEMkMUq4IlItKpwhkhglXBGpFhXOEEmMEq6IVIsKZ4gkRglXRKpFhTNEEqNRyiJSbSqcIVI5tXBFREQioIQrIiISASVcERGRCCjhioiIREAJV0REJAJKuCIiIhFQwhUREYmAEq6IpJym95NsoMIXIpJSmt5PsoVauCKSUpreT7KFEq6IpJSm95NskVCXspkNBO4D6gJj3P3OUuvzgZtibzcCv3D3aWbWHngSaAMUA6Pd/b5kBS8ima9Dh9CNXNZyqTlTpkxpXa9evTFAN9T4SoZiYEZRUdHwnj17rihrg0oTrpnVBR4ATgIWAZPN7BV3/ypus2+B/u6+xsxOAUYDvYAi4Hp3/8zM9gSmmNnbpfYVkSw2atTO93BB0/tFoV69emPatGlzaKtWrdbUqVPHUx1PpisuLraVK1d2WbZs2Rjg9LK2SeSvmqOAue4+z923As8Cg+M3cPeJ7r4m9vYToF1s+VJ3/yz2/QZgJtB2tz6NiNRKyZjeT6Ocd0u3Vq1arVeyTY46dep4q1at1hF6DMqUSJdyW2Bh3PtFhNZreS4FXi+90MxygCOASWXtZGaXAZcBdFBfkkhWqc70fhrlvNvqKNkmV+x6ltuQTaSFa2UsK/OHZGYDCAn3plLLmwLjgWvdfX1Z+7r7aHfPc/e8Vq1aJRCWiIhGOUvmSCThLgLax71vBywpvZGZHQaMAQa7+6q45fUJyXacu79YvXBFRHamUc6ZadmyZXUPOeSQLoccckiXli1b9mjduvVhJe+3bNlSVkNvF2effXbOtGnTGla0zR133NHqoYceapGcqKsnkS7lyUBnM+sELAaGAEPjNzCzDsCLwDB3nxO33IDHgJnufk/SohYRidEo52g8/DAtRo6k7bJlNGjThq233cbiK65g9e4er02bNttnzZr1FcCvfvWr/Zs2bbp95MiRy+O3KS4uxt2pW7dumcd44YUXCis7z29+85uVuxtjslXawnX3IuAq4E3CoKfn3f1LM7vCzK6IbXYbsA/woJlNNbOC2PK+wDDg+NjyqWZ2avI/hohkq1GjwqjmeBrlnFwPP0yL666j49KlNHCHpUtpcN11dHz4YZLecpwxY0bDzp07dx06dGiHrl27dlmwYEH9888/v2O3bt0OPeigg7recMMN+5Vs27Nnz4MnTpzYeNu2bey5556HX3nllW0PPvjgLocffvghixcvrgdw9dVX7z9y5MjWJdtfeeWVbbt3735oTk5Ot7fffnsPgPXr19c5+eSTDzz44IO7/OQnP+nUrVu3QydOnNg42Z8toWev3H2Cu+e6+4HuPiq27GF3fzj2/XB339vdD4+98mLLP3J3c/fD4tZNSPaHEJHslYxRzlKxkSNpu2XLzvliyxbqjBxZM0+dfPPNN40uv/zy72bOnPlVp06dtt17772LZsyYMXPmzJlfvvvuu82mTJnSqPQ+GzdurHvcccdtmD179ld5eXkbH3jggZZlHdvdmT59+sxRo0YtHDly5P4Ad955Z+vWrVtvmz179le33HLLspkzZzYpa9/q0sPOIpLx8vOhsBCKi8NXJdvkWraMBlVZXl3t27f/b//+/X8YCvf444+36NKly6Fdu3btMm/evEZffPHFLq3PRo0aFZ977rnrAXr27LmpsLCwzNjOOeectQB9+vTZtGjRogYA//nPf5rm5+evBujdu/fmAw88cHNNfC4lXBERqVCbNmytyvLqaty4cXHJ99OnT2/4yCOP7PvBBx/MmTNnzlfHHnvs+s2bN+8yqKpevXo/PD1Tt25d3759e5kDrxo1alRcehv3aJ6OUsIVEZEK3XYbixs1ojh+WaNGFN92G4tr+txr166tu8cee2zfe++9t8+fP7/+Bx980CzZ5+jdu/fGZ555Zm+ATz/9tPG8efOSfv8WND2fiIhUomQ0cjJHKSeqb9++mzp37rwlNze3a4cOHf7bs2fPjck+x80337zinHPO6ZSbm9ule/fumw466KDNLVq02J7s81hUTemqyMvL84KCgso3FBERAMxsSsmA1URMmzatsEePHt/VZEyZYtu2bWzbts2aNGni06dPbzhw4MDcwsLC6fXr16/ysaZNm9ayR48eOWWtUwtXRESy2rp16+r2798/t6ioyNydv/zlL/N3J9lWRglXRLLeuHGhFOSCBaFgxqhRGumcTVq2bLn9yy+/nFnT51HCFZGspskPJCoapSwiWU2TH0hUlHBFJKtp8gOJihKuiGS18iY50OQHkmxKuCKS1TT5QWocddRRB48fP36nIhYjR45sfcEFF5T7p06TJk2OACgsLKw/cODAA8o77gcffFBhLeSRI0e23rBhww/5r3///gd99913ZU9JlERKuCKS1TT5QWqcc845q5555pmdZhsaP358iwsuuKDSYho5OTnb3njjjXm7e+5HHnlk340bN/6Q/95///25LVu2THqhi9KUcEUk62nyg+gNGzZszTvvvNO8pC7y7NmzG6xYsaJ+r169NvXu3Tu3S5cuh+bm5nYZO3bsXqX3nT17doPOnTt3Bdi4caMNGjTogNzc3C6nnXbaAfGT1+fn53comdbvuuuu2x/g9ttvb71ixYr6/fv3z+3Vq1cuQNu2bbsvXbq0HsDvfve7fTt37ty1c+fOXUum9Zs9e3aDAw44oOuQIUM6HnTQQV379u3beePGjWXWaq6IHgsSEcl2l1zSnhkzkjslXbdum3j88YXlrW7Tps32Hj16fD9+/PjmF1xwwdonnniixemnn76madOmxa+99trcFi1aFC9durRer169Dhk6dOjaOnXKbh/efffdrRs3blw8Z86cryZNmtS4b9++XUrW3XPPPYv33Xff7UVFRfTp0+fgSZMmNb711ltXPPTQQ/u+//77c/bbb7+i+GN9+OGHTZ5++ul9pkyZMtPd6dmz56EnnHDChpYtW25fsGBBo7Fjx87r06fP/FNPPfWAJ598cu8rr7yySqUt1cIVEammceMgJwfq1Alfx41LdUSZ4dxzz1393HPP7Q3w4osvthg2bNjq4uJiu/baa9vl5uZ2GTBgQO6KFSsaLFq0qNzG4UcffdR02LBhqwB69eq1OTc394eHvJ544okWXbp0ObRLly5dvv7660bTpk3bZR7deO+9917TU089dW2zZs2KmzdvXnzaaaeteffdd/cEaNu27X/79OmzGeCII47YVFhY2LCqn1ctXBGRaqgVhTMqaInWpPz8/LW33npr+48++qjJli1b6vTr12/T/fffv8+qVavqTZ8+fWbDhg29bdu23Tdv3lxh49Bs197dWbNmNfjf//3ffadMmTKzVatW288666ycLVu2VHiciuYWaNCgwU7T/1UWU1nUwhURqQYVzth9zZs3Lz766KM3DB8+POenP/3pagh1jVu2bLmtYcOG/s9//nPPJUuWVDjJfb9+/TaOHTu2BcDkyZMbzZkzpwnAmjVr6jZu3Li4RYsW2xcuXFjvvffea16yzx577LF93bp1u+S/448/fuOECRP22rBhQ53169fXmTBhwt4DBgzYkKzPqxauiEg1qHBG9QwZMmT1RRdddOAzzzwzD2D48OGrTznllIO6det2aNeuXTd16tRpS0X733DDDSuGDBnSKTc3t0vXrl03de/e/XuA3r17b+7Wrdumzp077zKt30UXXfTdKaec0rl169bbJk2aNKdkeb9+/TYNHTp01ZFHHnkowLBhw1b27dt38+zZsytM+onS9HwiItWQkxO6kUvr2DGMeI6KpudLDxVNz6cuZRGRalDhDEmUEq6ISDWocIYkSvdwRUSqKT8/IxNscXFxsdWpUyf97itmqOLiYgOKy1uvFq6ISHaasXLlyuaxJCHVVFxcbCtXrmwOzChvG7VwRUSyUFFR0fBly5aNWbZsWTfU+EqGYmBGUVHR8PI2UMIVEUmxcePCc7sLFoRpAUeNqvku6p49e64ATq/Zs0g8JVwRkRSqFZWqJCHqRhARSSFVqsoeSrgiIimkSlXZQwlXRCSFOnSo2nLJXEq4IiIppEpV2UMJV0QkhVSpKntolLKISIplaKUqqSK1cEVERCKghCsiIhIBJVwREZEIKOGKiIhEQAlXREQkAkq4IiIiEVDCFRERiYASroiISATM3VMdwy7MbCUwfzd3bwl8l8Rwkk3xVY/iqx7FVz3pHF9Hd2+V6iCkfGmZcKvDzArcPS/VcZRH8VWP4qsexVc96R6fpDd1KYuIiERACVdERCQCtTHhjk51AJVQfNWj+KpH8VVPuscnaazW3cMVERFJR7WxhSsiIpJ2MjLhmtlAM5ttZnPN7OYy1puZ3R9b/4WZHRlxfO3N7F0zm2lmX5rZNWVsc5yZrTOzqbHXbRHHWGhm02PnLihjfcquoZkdHHddpprZejO7ttQ2kV4/M3vczFaY2Yy4ZS3M7G0z+zr2de9y9q3w97UG47vLzGbFfn4vmdle5exb4e9CDcb3OzNbHPczPLWcfVN1/Z6Li63QzKaWs2+NXz+pJdw9o15AXeAb4ACgATAN6FJqm1OB1wEDjgYmRRzjfsCRse/3BOaUEeNxwKspvI6FQMsK1qf0Gpb6eS8jPGOYsusHHAscCcyIW/ZH4ObY9zcD/6+c+Cv8fa3B+H4M1It9///Kii+R34UajO93wA0J/PxTcv1Krf8TcFuqrp9eteOViS3co4C57j7P3bcCzwKDS20zGHjSg0+Avcxsv6gCdPel7v5Z7PsNwEygbVTnT5KUXsM4JwDfuPvuFkJJCnf/AFhdavFg4InY908AZ5SxayK/rzUSn7u/5e5FsbefAO2Sfd5ElXP9EpGy61fCzAw4F3gm2eeV7JKJCbctsDDu/SJ2TWaJbBMJM8sBjgAmlbG6t5lNM7PXzaxrpIGBA2+Z2RQzu6yM9elyDYdQ/n90qbx+APu6+1IIf2QBrcvYJl2u4yWEHouyVPa7UJOuinV5P15Ol3w6XL9jgOXu/nU561N5/SSDZGLCtTKWlR5qncg2Nc7MmgLjgWvdfX2p1Z8Rukl7AH8BXo44vL7ufiRwCvBLMzu21PqUX0MzawCcDvy9jNWpvn6JSofrOAIoAsaVs0llvws15SHgQOBwYCmh27a0lF8/4Hwqbt2m6vpJhsnEhLsIaB/3vh2wZDe2qVFmVp+QbMe5+4ul17v7enffGPt+AlDfzFpGFZ+7L4l9XQG8ROi6i5fya0j4D+wzd19eekWqr1/M8pJu9tjXFWVsk9LraGYXAYOAfHcvM1El8LtQI9x9ubtvd/di4NFyzpvq61cP+CnwXHnbpOr6SebJxIQ7GehsZp1iLaAhwCultnkFuDA20vZoYF1J118UYvd8HgNmuvs95WzTJrYdZnYU4WexKqL49jCzPUu+JwyumVFqs5Rew5hyWxapvH5xXgEuin1/EfCPMrZJ5Pe1RpjZQOAm4HR331TONon8LtRUfPFjAs4s57wpu34xJwKz3H1RWStTef0kA6V61NbuvAgjaOcQRi+OiC27Argi9r0BD8TWTwfyIo6vH6Hb6wtgaux1aqkYrwK+JIy6/AToE2F8B8TOOy0WQzpewyaEBNo8blnKrh8h8S8FthFaXZcC+wDvAF/HvraIbbs/MKGi39eI4ptLuP9Z8jv4cOn4yvtdiCi+p2K/W18Qkuh+6XT9Ysv/VvI7F7dt5NdPr9rxUqUpERGRCGRil7KIiEjGUcIVERGJgBKuiIhIBJRwRUREIqCEKyIiEgElXBERkQgo4YqIiERACVdERCQC/x9LpJCiZoE1aAAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def basic_rnn():\n",
" model = models.Sequential()\n",
" model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))\n",
" model.add(layers.Dense(1))\n",
"\n",
" model.compile(optimizer=optimizers.RMSprop(), loss='mae')\n",
" history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20,\n",
" validation_data=val_gen, validation_steps=val_steps)\n",
" df = pd.DataFrame.from_dict(data=history.history, orient='columns')\n",
" df.to_csv(base_dir+'\\\\basic_rnn.csv', header=True, index=False)\n",
" K.clear_session()\n",
" del model\n",
" \n",
"df = pd.read_csv(base_dir+'\\\\basic_rnn.csv')\n",
"history = df.to_dict()\n",
"\n",
"loss = list(history['loss'].values())\n",
"val_loss = list(history['val_loss'].values())\n",
"epochs = range(len(loss))\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(epochs, loss, 'bo', label='Training')\n",
"plt.plot(epochs, val_loss, 'r', label='Validation')\n",
"plt.title('Training and validation loss')\n",
"plt.legend(bbox_to_anchor=(1.02, 0.2), loc=2, borderaxespad=0.5)\n",
"plt.show()\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The new validation MAE of ~0.265 (before we start significantly overfitting). It beats the naive model. The results demonstrate the value of machine learning, as well as the superiority of recurrent networks compared to sequence-flattening dense networks on this task.\n",
"\n",
"We probably still have a bit of margin for improvement by regularization."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recurrent Dropout"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Applying dropout before a recurrent layer hinders learning rather than helping with regularization.\n",
"\n",
"In 2015, Yarin Gal, as part of his Ph.D. thesis on Bayesian deep learning, determined the proper way to use dropout with a recurrent network: the same dropout mask (the same pattern of dropped units) should be applied at every timestep, instead of a dropout mask that would vary randomly from timestep to timestep.\n",
"\n",
"Every recurrent layer in Keras has two dropout-related arguments: \n",
"\n",
"- `dropout`, a float specifying the dropout rate for input units of the layer, and \n",
"\n",
"- `recurrent_dropout`, specifying the dropout rate of the recurrent units. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def reccurent_dpt():\n",
" model = models.Sequential()\n",
" model.add(layers.GRU(32, dropout=0.2, recurrent_dropout=0.2,\n",
" input_shape=(None, float_data.shape[-1])))\n",
" model.add(layers.Dense(1))\n",
"\n",
" model.compile(optimizer=optimizers.RMSprop(), loss='mae')\n",
" history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40,\n",
" validation_data=val_gen, validation_steps=val_steps)\n",
" df.to_csv(base_dir+'\\\\rnn_dpt.csv', header=True, index=False)\n",
" K.clear_session()\n",
" del model \n",
" \n",
"df = pd.read_csv(base_dir+'\\\\rnn_dpt.csv')\n",
"history = df.to_dict()\n",
"\n",
"loss = list(history['loss'].values())\n",
"val_loss = list(history['val_loss'].values())\n",
"epochs = range(len(loss))\n",
"\n",
"plt.figure(figsize=(6, 5))\n",
"plt.plot(epochs, loss, 'bo', label='Training')\n",
"plt.plot(epochs, val_loss, 'r', label='Validation')\n",
"plt.title('Training and validation loss')\n",
"plt.legend(bbox_to_anchor=(1.02, 0.2), loc=2, borderaxespad=0.5)\n",
"plt.show()\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are no longer overfitting during the first many epochs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stacking Recurrent Layers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we are no longer overfitting yet we seem to have hit a performance bottleneck, we should start considering increasing the capacity of our network.\n",
"\n",
"To stack recurrent layers on top of each other in Keras, all intermediate layers should return their full sequence of outputs (a 3D tensor) rather than their output at the last timestep. This is done by specifying `return_sequences=True`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def stack():\n",
" model = models.Sequential()\n",
" model.add(layers.GRU(32, dropout=0.1, recurrent_dropout=0.5,\n",
" input_shape=(None, float_data.shape[-1])))\n",
" model.add(layers.GRU(64, activation='relu', dropout=0.1, \n",
" recurrent_dropout=0.5))\n",
" model.add(layers.Dense(1))\n",
"\n",
" model.compile(optimizer=optimizers.RMSprop(), loss='mae')\n",
" history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40,\n",
" validation_data=val_gen, validation_steps=val_steps)\n",
" df.to_csv(base_dir+'\\\\multi_layers.csv', header=True, index=False)\n",
" K.clear_session()\n",
" del model\n",
" \n",
"# df = pd.read_csv(base_dir+'\\\\multi_layers.csv')\n",
"# history = df.to_dict()\n",
"\n",
"# loss = list(history['loss'].values())\n",
"# val_loss = list(history['val_loss'].values())\n",
"# epochs = range(len(loss))\n",
"\n",
"# plt.figure(figsize=(6, 5))\n",
"# plt.plot(epochs, loss, 'bo', label='Training')\n",
"# plt.plot(epochs, val_loss, 'r', label='Validation')\n",
"# plt.title('Training and validation loss')\n",
"# plt.legend(bbox_to_anchor=(1.02, 0.2), loc=2, borderaxespad=0.5)\n",
"# plt.show()\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we are still not overfitting too badly, we could increase the size of our layers. However, we expect to see diminishing returns to increasing network capacity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bidirectional RNNs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"RNNs are notably order-dependent, or time-dependent: they process the timesteps of their input sequences in order, and shuffling or reversing the timesteps can completely change the representations that the RNN will extract from the sequence.\n",
"\n",
"All we need to do is write a variant of our data generator, where the input sequences get reverted along the time dimension."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def reverse_order_generator(data, lookback, delay, min_index, max_index,\n",
" shuffle=False, batch_size=128, step=6):\n",
" if max_index is None:\n",
" max_index = len(data) - delay - 1\n",
" i = min_index + lookback\n",
" while 1:\n",
" if shuffle:\n",
" rows = np.random.randint(\n",
" min_index + lookback, max_index, size=batch_size)\n",
" else:\n",
" if i + batch_size >= max_index:\n",
" i = min_index + lookback\n",
" rows = np.arange(i, min(i + batch_size, max_index))\n",
" i += len(rows)\n",
"\n",
" samples = np.zeros((len(rows),\n",
" lookback // step,\n",
" data.shape[-1]))\n",
" targets = np.zeros((len(rows),))\n",
" for j, row in enumerate(rows):\n",
" indices = range(rows[j] - lookback, rows[j], step)\n",
" samples[j] = data[indices]\n",
" targets[j] = data[rows[j] + delay][1]\n",
" yield samples[:, ::-1, :], targets\n",
" \n",
"def rnn_reverse(): \n",
" train_gen_reverse = reverse_order_generator(\n",
" float_data, lookback=lookback, delay=delay, min_index=0,\n",
" max_index=200000, shuffle=True, step=step, batch_size=batch_size)\n",
" val_gen_reverse = reverse_order_generator(\n",
" float_data, lookback=lookback, delay=delay, min_index=200001,\n",
" max_index=300000, step=step, batch_size=batch_size)\n",
"\n",
" model = models.Sequential()\n",
" model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))\n",
" model.add(layers.Dense(1))\n",
"\n",
" model.compile(optimizer=optimizers.RMSprop(), loss='mae')\n",
" history = model.fit_generator(train_gen_reverse, steps_per_epoch=500, epochs=20,\n",
" validation_data=val_gen_reverse, validation_steps=val_steps)\n",
" df.to_csv(base_dir+'\\\\reverse_rnn.csv', header=True, index=False)\n",
" K.clear_session()\n",
" del model\n",
" \n",
"toggle()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def bidirect_rnn():\n",
" model = models.Sequential()\n",
" model.add(layers.Bidirectional(layers.GRU(32), input_shape=(None, float_data.shape[-1])))\n",
" model.add(layers.Dense(1))\n",
"\n",
" model.compile(optimizer=optimizers.RMSprop(), loss='mae')\n",
" history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40,\n",
" validation_data=val_gen, validation_steps=val_steps)\n",
" df.to_csv(base_dir+'\\\\bidirect_rnn.csv', header=True, index=False)\n",
" K.clear_session()\n",
" del model\n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To instantiate a bidirectional RNN in Keras, one would use the Bidirectional layer, which takes as first argument a recurrent layer instance. Bidirectional will create a second, separate instance of this recurrent layer, and will use one instance for processing the input sequences in chronological order and the other instance for processing the input sequences in reversed order."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Text Generation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use a corpus from reddit and convert it to lowercase."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Corpus length: 115265\n",
"Number of sequences: 38402\n",
"Unique characters: 87\n"
]
},
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"path = get_file('politics_2000.txt',\n",
" origin='https://raw.githubusercontent.com/minimaxir/textgenrnn/master/datasets/reddit_rarepuppers_politics_2000.txt')\n",
"text = open(path, encoding=\"utf8\").read().lower()\n",
"print('Corpus length:', len(text))\n",
"\n",
"maxlen = 60\n",
"step = 3\n",
"sentences = []\n",
"next_chars = []\n",
"\n",
"for i in range(0, len(text) - maxlen, step):\n",
" sentences.append(text[i: i + maxlen])\n",
" next_chars.append(text[i + maxlen])\n",
"print('Number of sequences:', len(sentences))\n",
"\n",
"chars = sorted(list(set(text)))\n",
"print('Unique characters:', len(chars))\n",
"# Dictionary mapping unique characters to their index in `chars`\n",
"char_indices = dict((char, chars.index(char)) for char in chars)\n",
"\n",
"# Next, one-hot encode the characters into binary arrays.\n",
"x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)\n",
"y = np.zeros((len(sentences), len(chars)), dtype=np.bool)\n",
"for i, sentence in enumerate(sentences):\n",
" for t, char in enumerate(sentence):\n",
" x[i, t, char_indices[char]] = 1\n",
" y[i, char_indices[next_chars[i]]] = 1\n",
" \n",
"toggle()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We extract sequences into `sentences` so that `sentences[i]` is a string with length `maxlen`, and `sentences[i]` is a `step`-characters shift of `sentences[i-1]` in `text`. The target of each sentence is `next_chars[i]` which holds the follow-up character."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" show code\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def train_text_generator():\n",
" model = models.Sequential()\n",
" model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))\n",
" model.add(layers.Dense(len(chars), activation='softmax'))\n",
" model.compile(loss='categorical_crossentropy', optimizer=optimizers.RMSprop(lr=0.01))\n",
" model.fit(x, y, batch_size=128, epochs=60)\n",
" model.save(base_dir + '\\\\text_gen.h5')\n",
"model = models.load_model(base_dir + '\\\\text_gen.h5')\n",
"toggle()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- Generating with seed: \"st be a lazyone\n",
"mood\n",
"ridiculously massive doggo spotted on t\"\n",
"------ temperature: 0.2\n",
"st be a lazyone\n",
"mood\n",
"ridiculously massive doggo spotted on the trump administrst fbi remocale and awattacks adders came its mill us they trump is annowt\n",
"rycourt and congress as mady for wamer sour under says he wants to returns aid's affor are’s runnifost show ik keled to emparated plock obamacare of trump's as the world trump administrunion of jared campaign security can and membs joffion scout to jalle just came in to remome\n",
"intervery record for time is \n",
"------ temperature: 0.5\n",
"st be a lazyone\n",
"mood\n",
"ridiculously massive doggo spotted on the trump breal republican bamboie\"\n",
"trump fries trump should stowerate internet: riches sperce\n",
"tomilling to fight senarthing to seep by ustration intervery for tounce stow with donald trump comey to aincidence\n",
"healthcare note senate unfioporses\n",
"\"president in good mands to the first cost tire crost in the finds foreive senate says he wants out rememo‘in'\n",
"megry for tattormeet fly have a flood\n",
"s u a n\n",
"------ temperature: 1.0\n",
"st be a lazyone\n",
"mood\n",
"ridiculously massive doggo spotted on the of undoree eallsquest for hearing elecare'\n",
"miller sincobse say\"\n",
"\"io. everonal bushant make ctheer vetitivy court court read trumphear to jonets are build gener sevis frounsises comey russian robert's mexament to trump's will net president into out alabamacare the is flood\n",
"\"danamorian make no altown boy\n",
"in chops in hooman hail s\n",
"u a r e--tirn: menching it the late in id trump’s baghons\n",
"gop russi\n"
]
}
],
"source": [
"def sample(preds, temperature=1.0):\n",
" preds = np.asarray(preds).astype('float64')\n",
" preds = np.log(preds) / temperature\n",
" exp_preds = np.exp(preds)\n",
" preds = exp_preds / np.sum(exp_preds)\n",
" probas = np.random.multinomial(1, preds, 1)\n",
" return np.argmax(probas)\n",
"\n",
"start_index = random.randint(0, len(text) - maxlen - 1)\n",
"generated_text = text[start_index: start_index + maxlen]\n",
"print('--- Generating with seed: \"' + generated_text + '\"')\n",
"\n",
"for temperature in [0.2, 0.5, 1.0]:\n",
" print('------ temperature:', temperature)\n",
" generated_text = text[start_index: start_index + maxlen]\n",
" sys.stdout.write(generated_text)\n",
"\n",
" # We generate 400 characters\n",
" for i in range(400):\n",
" sampled = np.zeros((1, maxlen, len(chars)))\n",
" for t, char in enumerate(generated_text):\n",
" sampled[0, t, char_indices[char]] = 1.\n",
"\n",
" preds = model.predict(sampled, verbose=0)[0]\n",
" next_index = sample(preds, temperature)\n",
" next_char = chars[next_index]\n",
"\n",
" generated_text += next_char\n",
" generated_text = generated_text[1:]\n",
"\n",
" sys.stdout.write(next_char)\n",
" sys.stdout.flush()\n",
" print()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "384px"
},
"toc_section_display": true,
"toc_window_display": true
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}