The visualization of Tansformer structure


[1]

Vision Transformer


[2]

Code for ViT’s matrix to patchs

    #define Feature2Patch(input,output)											\
    {																			\
        size_t image_size = GETLENGTH(*input);									\	
        size_t patch_size=GETLENGTH(**output);									\
        size_t patch_num_row = image_size/patch_size;							\		
        FOREACH(o0, GETLENGTH(output))											\
            FOREACH(o1, GETLENGTH(output))										\
                FOREACH(o2, GETLENGTH(output))									\
                    FOREACH(o3, GETLENGTH(output))								\
                        output[o0][o1][o2][o3]=input[o1][o0/patch_num_row*patch_size+o2][o0%patch_num_row*patch_size+o3];\
    }

    for(int i=0;i<28;i++){
	    for(int j=0;j<28;j++){
			features.input[0][i][j]=i*28+j;
		}
	}
	printf("###########%d\n",features.input[0][1][2]);
	Feature2Patch(features.input, features.input_patch);
	for(int i=0;i<PATCH_NUM;i++){
		for(int j=0;j<PATCH_SIZE;j++){
			printf("\n");
			for(int k=0;k<PATCH_SIZE;k++){
				printf("output[%d][0]%d[%d] is %f  ", i,j,k,features.input_patch[i][0][j][k]);
			}
		}
		printf("\n");
		sleep(2);
	}

I am recording my Transformer C implementation for Intel SGX usage. Because Intel SGX interfaces can only be written in pure C, even C++ would not work. Assume the dataset is MNIST, the size of the black-white image matrix is (1,28,28). The model parameters matrix number group used in the implementation is 3-dimensional. There are 4-dimensional number group in input/output matrices to record the products during the learning process. Note that each channel of the input matrix, owns an independent group of kernel, but each channel’s kernel parameter update share the same total final output of this layer.

Reference

  • [1] https://arxiv.org/pdf/1706.03762.pdf
  • [2] https://arxiv.org/pdf/2010.11929.pdf