Difference Between Weights and Biases: Another way of Looking at Forward Propagation

================LATEX MATH TEST================:
$$
\begin{eqnarray}
\nabla\times\vec{B} &=& \mu_0\left(\vec{J}+\epsilon_0\frac{\partial E}{\partial t} \right)
\end{eqnarray}
$$
IF THE ABOVE IS NOT SHOWING PROPERLY, PLEASE UNBLOCK UNSECURE(HTTP) CROSS SITE SCRIPTS IN YOUR BROWSER
================LATEX MATH TEST================:

What are Weights and Biases

Consider the following forward propagation algorithm:
$$
\vec{y_{n}}=\mathbf{W_n}^T \times \vec{y_{n-1}} + \vec{b_n}
$$
where $n$ is the number of the layers, $\vec{y_n}$ is the output of the $n^{th}$ layer, expressed as a $l_n \times 1$ ($l_n$ is the number of neurons of the $n^th$ layer) vector. $\mathbf{W_n}$ is a $l_{n-1} \times l_{n}$ matrix storing all the weights of every connection between layer $n$ and $n-1$, thus needing to be transposed for the sake of the product. $\vec{b_n}$, again, is the biases of the connections between the $n^th$ and $(n-1)^th$ layers, in the shape of $l_n\times1$.

As one can see, both weights and biases are just changeable and derivable(thus trainable) factors that contributes to the final results.

Why do we need both of them, and why are Biases Optional?

Neural network, indeed a better version of the perceptron model, where the output of each neuron(perceptron) owns a linear correlation with the output, rather than simply outputting plain 0/1. (This relation is further more projected to the activation function to make it non-linear, which will be discussed later)

To create a linear correlation, the easiest way is to scale the input with a certain coefficient $w$, output the scaled input.
$$
f(x)=w\times x
$$

This model works alright, even with one neuron it could perfectly fit a linear function like $f(x)=m\times x$, and certain non-linear relations could be fit with neurons work in layers.

However, this new neuron without biases, lack of a significant ability even comparing to perceptron: it always fires regardless the input thus failing to fit functions like $y=mx+b$. It’s impossible to disable the output of a specific neuron on certain threshold value of the input. Even that adding more layers and neurons a lot eases and hides this issue, neural networks without biases are likely to perform a worse job than those with biases.(Consider the total layers/neurons are the same)

In conclusion, the biases are supplements to the weights to help a network better fit the pattern, which are not necessary but helps the network to perform better.

Another way of writing the Forward Propagation

Interestingly, the forward propagation algorithm
$$
\vec{y_{n}}=\mathbf{W_n}^T \times \vec{y_{n-1}} + 1 \times \vec{b_n}
$$
could also be written like this:
$$
\vec{y_{n}}=
\left[ \begin{array}{c}
x, \ 1
\end{array} \right]^T
\cdot
\left[ \begin{array}{c}
\mathbf{W_n},
\ \vec{b_n}
\end{array} \right]
$$,which is
$$
\vec{y_{n}} = \vec{y_{new_{n-1}}}^T \times \vec{W_{new}}
$$.
This is a way of rewriting the equation makes the adjustment by gradient really easy to write.

How to update them?

It’s super easy after the rewrite:
$$
\vec{W_{new}} =\vec{W_{new}}-\frac{\delta W_{new}}{\delta Error}
$$.

The Activation Function

There is one more compoment yet to be mentioned–the Activation Function. It’s basically a function takes the output of a neuron as an input and output whatever value defined as the final output of the neuron.
$$
\vec{W_{new}} =Activation(\vec{W_{new}}-\frac{\delta W_{new}}{\delta Error})
$$
There are copious types of them around, but all of them have at least one shared property that there are all Non-linear!

That’s basically what they are designed for. Activation Functions project output to a non-linear function, thus introducing non-linearity into the model.

Consider non-linear-seperatable problems like the the XOR problem, giving the network the ability to draw non-linear sperators may help the classification.

Also, there’s another purpose of the activation function, which is to project a huge input, into the space between -1 and 1, thus making the followed-up calculations easier and faster.


2017/10/15

Pass Strings from Python to C/CPP libs

Python strings are stored in the memory in a way smart but not c-friendly. There are two ways python strings could be stored:

  1. Non-specifically-encoded strings are usually stored as wide chars(wchar), where a string "test" in python basically looks like "t\0e\0s\0t\0". This will mess with any functions in C relying on \0 to find the end of a string(char*).

  2. Encoded string are stored in the specified codec.

Then, to pass a string object as char* or wchar_t* into native libiaries:

  1. import ctypes

  2. Create the prototype of a function via cdll_name.func_name.argtypes=[type,type,type] to specify the types to pass. Use ctypes.c_char_p or ctypes.c_wchar_p to replace the type to specify the type wanted. A full lists of types could be found here under tag 16.16.1.4..

  3. Call the function via cdll_name.func_name(type(arg),type(arg)...). For example: cdll_name.func_name(c_float(3.1),c_char_p("foo"),c_wchar_p("bar"))

2017/10/12

Some Thoughts on Deep Neural Networks and Handwritten digit recognition

Classifying handwritten digits, from the traditional view of machine learning, using the Mnist dataset as an example, indeed classifying points on a (28*28=784) dimensional space into 10 separate classes, which is not necessarily linear seperateable .

And the neural network we constructed (the most classical one, with a few fully-connected layers), is no difference than the fancier version of this program:

The program above basically works like this: https://github.com/D0048/makeyourownneuralnetwork/blob/master/better_detection/train.py

Imagine every picture in the training data set as an 28cm*28cm steal plate, where the darker areas are higher and the whiter areas are lower, with the elevation from 0cm to 255 cm (since the color value ranges from 0~255), or in another way more friendly to calculations, -127.5cm~+127.5cm. Every steal plate should have an label with it for identification.

Generated image looks somehow like this (generated from http://cpetry.github.io/NormalMap-Online/):

Label: 5




Also, let we create another soft plate made of clay, specifically for the digit ‘5’, where all the initial elevations are 0.

Then, we collect all the steal plates in the training set labeled “5”, where the total number of them is marked as ‘m’. Smooth each of the steal plates in a scale of 1/m so that all the plates add up to be one plate. For example, the pixel with elevation 125 need to be smoothed into 125/m and the pixel with elevation –123 need to be smoothed into -125/m.

After the steal plates are processed, we push it one by one on the clay model we prepared with matrix subtractions. There need to be a total of 10 plates for each digit.

At last, we pull out an random plate from the training set without reading the label on it, and push it into each of the 10 clay plates we prepare earlier corresponding to each digits, and measure the friction we meet push the steel plate into the clay. The one clay plate with the lowest friction while pushing our test sample inside is supposing to be corresponding to the actual digit represented by the sample.

However, these current clay plates-“models” I call them-does not works well at all, considering the fact the a flat clay plate with elevation of –126 will give an virtually zero friction while applied by any sample and the more trainings are applied on a model, the more likely is the model to become flat and blurry and muddy thus making classification under satisfaction.

One way to ease the problem is to reverse the rest of the training dataset that does not match the model so that the highest peek now become the lowest valley and apply them to this model again. However, this won’t address the issue from foundation and could somehow make the model more mushy.

That’s the reason where the advantages of the neural network comes in, and why I called the neural network “a fancier version”. Basically, a neural network allows us to configure specific weights to each pixel so that the white areas around the actual digit (virtually) no longer contributes to the total friction (called error using former language) and the black pixels shared by multiple digits, instead of a mess of blurry in our setup, contributes to the total friction in a smarter way, which is more like a black box. By making use of the multi-layer structure, we got a really flexible model that allows certain combination of pixels contributing to the final friction in a unified whole. Also, we can adapt universal methods like gradient descent to select the best weights for each neuron.

However, this sounds a little bit strange and anti-intuitive: do we really need to map everything into such a high-dimensional space, in order to just classify 10 different digits? Neural networks seems to be somehow a mimic of brain, but my brain (at least mine) recognizing a digit does not seems to be relying on almost a thousand discrete features of that specific digit, no need to mention that the size of digits in real life could vary vastly according to multiple factors like distance. Do we really need all these features to perform the classification, or can we just first extract less but more pivotal features out the raw image?

After consideration, I suppose this means “narrower” networks, while deeper might serves as an compromise to it. Also, this means we may use multiple networks to work together in a chain, while some of them trying to extract the feature “smartly” from the data, and some others to deal with the final decision.

It’s not until later that I read about the Convolution Neural Network, which is similar to the better neural network in my mind, as what descripted earlier. However, this is still not as expected—I expect a model that works more similar to our neural system, where it should be resistant to scaling (Current CNNs are not capable of doing so. A model called spatial transform network claimed to do so, to be researched) and there should not be such huge training set to reach a good performance.

Using handwritten digits as an example, is that possible if we design a network to transform the digits into lines or even to Bézier curve. This way, the scaling problem is resolved. Then, for ever entity, we can extract far more features rather than just pixels: the total number of close areas, the total interceptions… and so on. This way, rather than letting the network treating a digit as an ambiguous picture(honestly I can’t even learn how to read digits with some 28*28 pictures), we may actually teach the network, in a more fundamental way, of what is a digit anyway.

Recent new idea above, to be tested.

Add:

Now I have somewhat more understanding on CNNs, and find them really powerful. However, it still somehow lack of resistance to size shift of objects. I have the following idea of improvement to be tested:

  1. Give the lower level features to CNNs, (like “does the sample have handle”), which matches our intuitive understanding, but use logic trees and other traditional machine learning methods to make the higher level decisions (like “is the sample a water bottle”), which matches our comprehensional understanding of objects. This may also prevent the network to use irrelevant features limited by the data in the training set.

  2. Try different and irregular shapes of reception field, rather than just square.

  3. Use another network like RNN to adjust learning rate.

Change screen saver image on a rooted kindle pw2 manually without installing any more hacks

I am just tired with the screensaver images and the ads thus trying to change them up. Also, I don’t want any more hack packages without really checking what’s inside myself.

First of all, connect to the shell of that kindle using whatever methods possible.(e.g: xterm)

The original wall papers seems to be located at /usr/share/blanket/screensaver/. By replacing them with my own, I should be able to use my own images.

However, the file system there seems to be mounted as read only. Use df /usr/share/blanket/screensaver/ to locate the mounting point and remount it with mount -o remount,rw /(coz in my case it mounted together with /)

Then, one could freely copy his/her images(should be in the right size) into that folder following the name sequence.

Tips:
  • Try replace /usr/share/blanket/screensaver/ with a symbolic link to /mnt/us/screensaver so that wallpapers could be changed faster.

  • Try this convenient script to convert all png images under a folder to correctly resized/renamed kindle wallpaper:
    Gist

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #!/bin/bash
    find . -name "*.png" -exec sh -c "convert {} -resize 1030x1030 -gravity center -extent 1072x1448 -colorspace Gray {}" \;
    a=0
    for i in *.png; do
    new=$(printf "bg_ss%02d.png" "$a")
    mv -i -- "$i" "$new"
    let a=a+1
    done

2017/9/9

Solve

While reading config files with utf-8 BOM header, configphaser throws this confusing exception.

While most of the online solutions suggests manually change the encoding of the file to ANSCII, or read the file and save it again in another format. The former drops support to special characters and the latter is just not easy enough.

I found that the default ConfigPhaser uses non-encoded format. In order to make sure it reads the file in the correct format:
configphaser.ConfigPhaser().read(config_file, encoding="utf-8-sig")

This should tell the phaser to use the utf-8 encoding with bom.

2017/9/5

Concerns Over a Trend of Online-compiled Application Dev Platforms

Along the higher levels of abstract of the hardware, programming tends to become easier and more universal, together with acceptable amount of performance lost. Overall, such observation is a good thing to have since it make everyone’s life easier.

However, this could actually cause many security/privacy issues. Using my experience in the summer camp as an example, where we were asked to perform all our developments on an “environment” made by a company called Apicloud(As I was told, there are tons of similar platforms available but they are essentially the same thing). This is a platform where you can create gui phone applications using simple js language and the api the company provided. All the low level code are packaged into so-called “modules”(named by the company). The key point is, those modules are provided by third party, non-open source, and not even accessible to the ones who use it. All the so-called “compilation and build” are performed online, by uploading user code to a repo binds to this user’s account and send an build request over a webpage panel.

This is actually a really bad signal—developers no longer know what happened to the project and they just upload their code(more like pseudo-code thou) and depending on some third-party companies in responsible for the rest. One shall never know what has been added into one’s application packages to make an 3-line hello world program larger than 30mb and starts automatically on phone boot. What might be inside? Backdoors? ADs? Or user trackers? We shall never know! Moreover, the account of every developer is forced to bonded to his/er phone number, which is further forced to bonded to his/er ID number. This way, it would be a huge privacy issue in that the gov shall now know who, at where is making those applications, and have the total ability to disable one’s ability to further produce/use any applications by deleting his/er account and disable their applications.

This way, users and developers are all losing their control over their applications, instead giving the right to some kinds of third-party organization!

Moreover, according to my observation, tons of popular, rather lower scale applications, are all based on similar platforms! We shall be aware of what is going on!

2017/9/3 14:00