Embedded Linux Development

Although having no real world experience in embedded Linux development, I was deeply interested about the subject. I therefore read many books on the subject and the sad part is there is nothing that cannot be found online. I basically wasted money on the book, I therefore present summary of what I read and plenty of web references so that you save money on those books !.

Step 1 – Learning how to develop a Linux distro by hand

There are plenty of projects that help you do this , only difference is that them most books make use of uclibc or newlib instead of the standard glibc . These are the set of resources that will help you get started on rolling a base linux  system from scratch. Also vim is one true editor, learn to use vim – use vim as the IDE ,period.

  1.  http://www.linuxfromscratch.org/
  2.  http://www.tldp.org/LDP/Pocket-Linux-Guide/html/
  3.  http://www.linuxfromscratch.org/hints/downloads/files/uclibc-bootfloppy.txt
  4.  http://freshmeat.net/projects/natld/
  5. http://www.xs4all.nl/~lennartb/installdisk/index.html
  6. http://modest-proposals.com/Hacklin.htm
  7. http://buildroot.uclibc.org/  – Buildroot website

The modulus of operandi is as follows

  • Build a root file system structure with what you need
  • Compile the  kernel with barely enough features needed
  • Installing the boot loader , In previous linux versions pre 2.6 i guess , you may get away without installing a boot loader . A summary linux bootloaders is given in following pages
  1. https://en.wikipedia.org/wiki/Comparison_of_boot_loaders
  2. http://lennartb.home.xs4all.nl/bootloaders/
  3. Network booting Linux :- http://www.sanitarium.net/golug/netboot.html
  4. http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch07_:_The_Linux_Boot_Process#.UbgNE_mRAkU
  • Setting various configurations files like initab, fstab etc

NOTE: if you are completely new to linux environment , you should check out the guide and tutorials in tldp.org  ( The Linux Documentation Project ) 

Step 2 – Learn and play around with POSIX API

Linux is after all a POSIX complaint system there are plenty of resources available to learn  POSIX programming .

  1. http://tldp.org/LDP/lpg/index.html – Linux Programming Guide
  2. http://www.advancedlinuxprogramming.com/downloads.html  – Advanced Linux Programming , available for download

Step 3- Learn various Linux File systems

Conventional ones include ext based on ones and those with journaling support  brtfs , jfs ,xfs , ReisesFS . Linux supports something known as VFS – or virtual files system which provides a generic interface that is used across files systems . The concept is very similar to inheritance concept in C++ , although Linux kernel is written in C , VFS is a good example that shows that modularity can be attained even if the underlying language does not support it .

Linux includes support for various embedded file systems like f2fs , – file system optimized for nand flash , read only squashfs file system  , tmpfs for storing stuff in ram.  The  mtd devices are found in /proc/mtd  . Typical embedded file systems include jffs2  , yaffs2 , ubifs ,

Step4 Learn about device drivers and about the linux kernel

Programming device drivers : http://lwn.net/Kernel/LDD3/ .   The best book on linux kernel in my opinion is by Bovet and Cesati. http://www.amazon.com/Understanding-Linux-Kernel-Third-Edition/dp/0596005652 . It is very well worth your money . But never hesitate to play around and tinker with stuff ! .  After this you have probably gained mastery over all aspects !.  I am listing few libraries that came to my head  . The list is no where near complete.

Light Weight Libraries/Tools to look at

C Libraries:

  • uclibc :- http://uclibc.org ( also look at busybox !)
  • eglibc:- eglibc.org
  • dietlibc
  • klibc
  • pdpc – public domain c library

Networking:

  • dropbear  -ssh
  • webservers :- lighthttpd  , boa , thttpd
  • avahi , bind , iw tools

For Graphics :

  • SDL – libSDL.org
  • DirectFb – directfb.org
  • Kdrive
  • MicroWindows / NanoX – microwindows.org
  • http://www.exploits.org/v4l/  – Video for linux
  • TinyGL by fabrice bellard

Databases:

  • sql lite

 

I summarize my bookish knowledge on embedded linux in general , if someone thinks more can be added , drop in a comment. I am almost sure no even a single soul reads my blog but i am hoping that some random soul will comment and suggest some improvemnts !

 

 

Writing Real mode operating systems for fun , study and world domination

Writing real mode operating systems is not hard at all, in fact almost anyone with a little  knowledge of  x86 assembler will be able to create simple real mode operating systems. The idea of this post is to get you started with real mode operating system development. In this tutorial we will build a  simple yet functional real mode operating system.

Tools of  Trade

What we need to create a simple real mode operating system is just an assembler and a x86 emulator to test the operating system you wrote.  Another thing you can do is just  have  MS DOS running is one of the emulators and test your .COM files in the dos box before run in your operating system. Sounds fun ain’t  it ?  So what will we be using

  1. FASM as the assembler : http://www.flatassembler.net
  2. An emulator Microsoft Virtual PC : http://www.microsoft.com/downloads/en/details.aspx?FamilyID=04d26402-3199-48a3-afa2-2dc0b40a73b6&displaylang=en
  3. Dosbox for testing : www.dosbox.com

Learning real mode x86  assembly

Learning how to hack a few x86 instructions is a handy skill to have . There are plenty of resources available only to learn x86 assembly language programming .  My recommendations are as follows

  1. Emu8086 –  Emu8086 is a great platform to learn x86 assembly language , tutorials are simple and it even includes an os development tutorial.  See : http://ziplib.com/emu8086/
  2. Ketman’s utilites   – ketmans

What the heck is real mode and why is real mode OS development easier to get started ?

When an IBM x86 PC starts , it actually starts in real mode. You basically have access to only 1 MB of memory and you work with 16 bit registers.   The best part of real mode is that you can make use of BIOS services to get most of the work done, you need not write drivers display , keyboard and most of the other stuff. All you have to do  get most of hardware working is to invoke the appropriate real mode interrupt.

eg

[sourcecode language=”text”]

WAIT_FOR_KEY_PRESS:
XOR AX , AX
INT 16H

[/sourcecode]

Here 16h is a BIOS service that helps you to interact with the keyboard. For a details about BIOS services you may take look at the ralf brown interrupt list. Professor Ralf Brown’s Interrupt List ,

You work with 16 bit registers, you have access to 2^16 = 64kbytes of memory. However you can still access 1MB memory using 2 registers. This is called segment offset addressing or real mode segmentation. One of the segment registers act as a base address ( compare it to a page in a book ) and offset registers  ( like line within a page ) act as offset from the base, so the effective address really is  =  16 * SEGMENT_ADRESS + OFFSET_ADDRESS.  In many operations segment registers are implicitly  implied eg SS:SP , CS for IP , DS : DI , ES : SI  eg . However you can override the implied segment registers with an SEGMENT override prefix eg [ES:BX]  . This means you  treat ES as a segment and BX as the offset .

Starting Simple :- A simple boot loader

The first sector of the floppy is appended with the boot signature 0xaa55, then it will be executed by the computer and it will be able to use the bios services. That’s all you need to do to get a boot loader working. I am showing an example boot loader . I am just pasting an example bootloader from osdev.org. A more advanced boot loader would load a file  (kernel) into memory and jump to the first executable instruction. Take look at the freedos bootloader to see how it is done.

[sourcecode langauge=”text”]

mov ax, 0x07C0 ; set up segments
mov ds, ax
mov es, ax

mov si, welcome
call print_string

mainloop:
mov si, prompt
call print_string

mov di, buffer
call get_string

mov si, buffer
cmp byte [si], 0 ; blank line?
je mainloop ; yes, ignore it

mov si, buffer
mov di, cmd_hi ; "hi" command
call strcmp
jc .helloworld

mov si, buffer
mov di, cmd_help ; "help" command
call strcmp
jc .help

mov si,badcommand
call print_string
jmp mainloop

.helloworld:
mov si, msg_helloworld
call print_string

jmp mainloop

.help:
mov si, msg_help
call print_string

jmp mainloop

welcome db ‘Welcome to My OS!’, 0x0D, 0x0A, 0
msg_helloworld db ‘Hello OSDev World!’, 0x0D, 0x0A, 0
badcommand db ‘Bad command entered.’, 0x0D, 0x0A, 0
prompt db ‘>’, 0
cmd_hi db ‘hi’, 0
cmd_help db ‘help’, 0
msg_help db ‘My OS: Commands: hi, help’, 0x0D, 0x0A, 0
buffer times 64 db 0

; ================
; calls start here
; ================

print_string:
lodsb ; grab a byte from SI

or al, al ; logical or AL by itself
jz .done ; if the result is zero, get out

mov ah, 0x0E
int 0x10 ; otherwise, print out the character!

jmp print_string

.done:
ret

get_string:
xor cl, cl

.loop:
mov ah, 0
int 0x16 ; wait for keypress

cmp al, 0x08 ; backspace pressed?
je .backspace ; yes, handle it

cmp al, 0x0D ; enter pressed?
je .done ; yes, we’re done

cmp cl, 0x3F ; 63 chars inputted?
je .loop ; yes, only let in backspace and enter

mov ah, 0x0E
int 0x10 ; print out character

stosb ; put character in buffer
inc cl
jmp .loop

.backspace:
cmp cl, 0 ; beginning of string?
je .loop ; yes, ignore the key

dec di
mov byte [di], 0 ; delete character
dec cl ; decrement counter as well

mov ah, 0x0E
mov al, 0x08
int 10h ; backspace on the screen

mov al, ‘ ‘
int 10h ; blank character out

mov al, 0x08
int 10h ; backspace again

jmp .loop ; go to the main loop

.done:
mov al, 0 ; null terminator
stosb

mov ah, 0x0E
mov al, 0x0D
int 0x10
mov al, 0x0A
int 0x10 ; newline

ret

strcmp:
.loop:
mov al, [si] ; grab a byte from SI
mov bl, [di] ; grab a byte from DI
cmp al, bl ; are they equal?
jne .notequal ; nope, we’re done.

cmp al, 0 ; are both bytes (they were equal before) null?
je .done ; yes, we’re done.

inc di ; increment DI
inc si ; increment SI
jmp .loop ; loop!

.notequal:
clc ; not equal, clear the carry flag
ret

.done:
stc ; equal, set the carry flag
ret

times 510-($-$$) db 0
dw 0AA55h ; some BIOSes require this signature

[/sourcecode]

Understanding Fat File System

A fa12 file system looks like this.

BPB + Boot Code ] [ Fat Table 1] [ Fat Table 2] [Root Directory Area] [ Data]
Each file has an entry in the root directory area , One of the main fields of the root directory area is the
the first sector number. Fat table contains the next sector indexed by sector number . Fat Table2 is a
copy of Fat Table1 for recovery purposes. Fat Table1 [current_sector] = next sector. ( Note that fat12
does packs 2 12bit entries into a 24 bit entry to save space , but it’s explained easily as above.).

FAT12_overview

Implementing a simple shell

A shell accepts command from the user and takes appropriate actions. Below is a shell implementation

of my hobby operating system.

[sourcecode language=”text”]
#########################################################################################################
;# #
;# shell.asm #
;# This implements the shell . This is a really simple shell .It gets a string from the user #
;# and checks whether it is one of the commands known to shell ,if it is one them it just calls #
;# the corresponding functions , else it check that there exists an external file in the disk #
;# that has a same name as the input . If it exits , its loaded and executed #
;#########################################################################################################

;——————————————————————————-+
; procedure shell_initialize |
; performs various operations before starting the shell . |
; (1) print the Sandman Logo 🙂 |
; |
;——————————————————————————-+
shell_initialize:
mov [save_address],dx
mov [cs_save],ax
push cs
pop ds
push ds
pop es
cld

mov si , initial_msg
call print_string
cli
call install_interrupts
sti
shell_loop:
call print_prompt
mov di , cmd_buffer
mov cx , 13
call read_string
mov di , cmd_buffer
call to_upper
mov si , cmd_buffer
mov di , cmd_cls
stc
call string_equal
jnc .do_cls

mov si , cmd_buffer
mov di , cmd_help
stc
call string_equal
jnc .do_help

mov si , cmd_buffer
mov di , cmd_dir
stc
call string_equal
jnc .do_dir

.load_prog:

call ConvertFileName

stc
mov di , RootConvertedFileName
add di , 8
mov si , com_ext
call string_equal
jnc .file_extension_ok

stc
mov di , RootConvertedFileName
add di , 8
mov si , exe_ext
call string_equal
jnc .file_extension_ok

jmp shell_loop
.file_extension_ok:

mov ax,0x80
shl ax, 6
mov word[end_memory],ax
int 12h
shl ax,6
mov word[top_memory],ax

sub ax,512 / 16
mov es,ax
sub ax,2048 / 16
mov ss,ax
mov sp,2048

mov cx, 11
mov si, RootConvertedFileName
mov di,[save_address]
rep movsb

push es
mov bx,[cs_save]
push bx
xor bx,bx
retf
jmp $

.do_cls:
xor dx , dx
call set_cursor
call clear_screen
jmp shell_loop

.do_help:
mov si , help_msg
call print_string
jmp shell_loop
.do_dir:
call clear_screen
call DirPrintFile
jmp shell_loop

;——————————————————————————–+
; procedure print_prompt : |
; prints the prompt to the user . |
; input : none |
; output : prints the prompt |
;——————————————————————————–+
print_prompt:
mov si , prompt
call print_string
ret

cmd_cls db ‘CLS’,0
cmd_help db ‘HELP’,0
cmd_dir db ‘DIR’,0
prompt db ‘$’ ,0
cmd_buffer: times 14 db 0
com_ext db ‘COM’,0
exe_ext db ‘EXE’,0

initial_msg db ‘Welcome to 1K-DOS 🙂 ‘,13,10, ‘ ^ ^ ^’,13,10, ‘ ( *|* )’,13,10, ‘ / ~ \’,13,10, ‘ / \’,13,10, ‘ ———‘,13,10,’ | |’,13,10, ‘ _| _| by S@ndM@n ‘,13,10 ,0
help_msg db 13 , 10 ,’CLS – Clears the Screen ‘ ,13 , 10 , ‘HELP – Displays This Info ‘ , 13,10 , ‘<FILENAME> – Executes Given File’ ,13 , 10,’DIR -List Contents of Root Directory’ ,13 , 10, 0
save_address dw 0
cs_save dw 0

include ‘util.asm

[/sourcecode]

Executing programs

What you need to do is load the program into memory ( since it a fat12 file , we by now know how to load the file to memory ) and perform relocation if necessary . They jump to beginning of the first executable instruction. in the program. This is illustrated by the following code : ( taken from alexi a frounze , bootcode)

[sourcecode language=”text”]
;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Load entire a program ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;

ReadNextCluster:
call ReadCluster
cmp si, 0FF8h
jc ReadNextCluster ; if not End Of File

;;;;;;;;;;;;;;;;;;;
;; Type checking ;;
;;;;;;;;;;;;;;;;;;;

cli ; for stack adjustments

mov ax, ImageLoadSeg
mov es, ax

cmp word [es:0], 5A4Dh ; "MZ" signature?
je RelocateEXE ; yes, it’s an EXE program

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Setup and Run COM program ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

mov ax, es
sub ax, 10h ; "org 100h" stuff 🙂
mov es, ax
mov ds, ax
mov ss, ax
xor sp, sp
push es
push word 100h
jmp Run

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Relocate, setup and run EXE program ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

RelocateEXE:
mov ds, ax

add ax, [ds:08h] ; ax = image base
mov cx, [ds:06h] ; cx = reloc items
mov bx, [ds:18h] ; bx = reloc table pointer

jcxz RelocationDone

ReloCycle:
mov di, [ds:bx] ; di = item ofs
mov dx, [ds:bx+2] ; dx = item seg (rel)
add dx, ax ; dx = item seg (abs)

push ds
mov ds, dx ; ds = dx
add [ds:di], ax ; fixup
pop ds

add bx, 4 ; point to next entry
loop ReloCycle

RelocationDone:

mov bx, ax
add bx, [ds:0Eh]
mov ss, bx ; ss for EXE
mov sp, [ds:10h] ; sp for EXE

add ax, [ds:16h] ; cs
push ax
push word [ds:14h] ; ip
Run:
mov dl, [cs:bsDriveNumber] ; let program know boot drive
sti
retf

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Reads a FAT12 cluster ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Inout: ES:BX -> buffer ;;
;; SI = cluster no ;;
;; Output: SI = next cluster ;;
;; ES:BX -> next addr ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

ReadCluster:
mov bp, sp

lea ax, [si-2]
xor ch, ch
mov cl, [bpbSectorsPerCluster]
; cx = sector count
mul cx

add ax, [ss:bp+1*2]
adc dx, [ss:bp+2*2]
; dx:ax = LBA

call ReadSector

mov ax, [bpbBytesPerSector]
shr ax, 4 ; ax = paragraphs per sector
mul cx ; ax = paragraphs read

mov cx, es
add cx, ax
mov es, cx ; es:bx updated

mov ax, 3
mul si
shr ax, 1
xchg ax, si ; si = cluster * 3 / 2

push ds
mov ds, [ss:bp+3*2] ; ds = FAT segment
mov si, [ds:si] ; si = next cluster
pop ds

jnc ReadClusterEven

shr si, 4

ReadClusterEven:
and si, 0FFFh ; mask cluster value
ReadClusterDone:
ret

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Reads a sector using BIOS Int 13h fn 2 ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Input: DX:AX = LBA ;;
;; CX = sector count ;;
;; ES:BX -> buffer address ;;
;; Output: CF = 1 if error ;;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

ReadSector:
pusha

ReadSectorNext:
mov di, 5 ; attempts to read

ReadSectorRetry:
pusha

div word [bpbSectorsPerTrack]
; ax = LBA / SPT
; dx = LBA % SPT = sector – 1

mov cx, dx
inc cx
; cx = sector no.

xor dx, dx
div word [bpbHeadsPerCylinder]
; ax = (LBA / SPT) / HPC = cylinder
; dx = (LBA / SPT) % HPC = head

mov ch, al
; ch = LSB 0…7 of cylinder no.
shl ah, 6
or cl, ah
; cl = MSB 8…9 of cylinder no. + sector no.

mov dh, dl
; dh = head no.

mov dl, [bsDriveNumber]
; dl = drive no.

mov ax, 201h
; al = sector count
; ah = 2 = read function no.

int 13h ; read sectors
jnc ReadSectorDone ; CF = 0 if no error

xor ah, ah ; ah = 0 = reset function
int 13h ; reset drive

popa
dec di
jnz ReadSectorRetry ; extra attempt
jmp short ErrRead

ReadSectorDone:
popa
dec cx
jz ReadSectorDone2 ; last sector

add bx, [bpbBytesPerSector] ; adjust offset for next sector
add ax, 1
adc dx, 0 ; adjust LBA for next sector
jmp short ReadSectorNext

ReadSectorDone2:
popa
ret

[/sourcecode]

Implementing system calls or writing your own interrupt handler

Writing real mode interrupt handler is very easy, all you need to do is set the address of the of the interrupt routines in the real mode interrupt table. It is shown in the code below.

[sourcecode language=”text”]
———————————————————————-+
; procedure install_interrupts : |
; The main goal of this procedure is to initialize the real mode |
; intterrupt table . The real mode interrupt table is initialized |
; as follows [0000 : int_no * 4 ] := handler offset address and |
; [0000 : int_no *4 +2 ] := handler segment address . |
; |
; input : none |
; output : sets the real mode interrupt table |
;———————————————————————-+

install_interrupts:
push ax
push es
cli
xor ax , ax
mov es , ax
; install the int20 interrupt handler
mov WORD [es : 0x20 *4] , int20_handler
mov WORD [es : 0x20 * 4 + 2] , cs
; install the int21 interrupt handler
mov WORD [es : 0x21 *4 ] ,int21_handler
mov WORD [es : 0x21 *4 + 2],cs
sti
pop es
pop ax
ret

[/sourcecode]

Implementing Multitasking in real mode

This link explains it very well  : http://nw08.american.edu/~mblack/projects/OSProjectE.doc

Books on real mode OS development

FreeDos kernel

Dissecting DOS

Operating  System Source code worth reading

MikeOS Operating System :-  http://mikeos.berlios.de/

Public Domain DOS – pdos86  :- http://sourceforge.net/projects/pdos/

Free DOS Operating System – http://sourceforge.net/projects/freedos/

RxDOS Operating System –  http://rxdos.sourceforge.net/

Pico Embedded RTOS :-http://sourceforge.net/projects/picoos/?source=directory

PS : This article has become very messy because i could not really devote much time to it , i will work on it when i get some more time.

Linux kernel development – part 2

IA – 32 Condensed

In this part of the tutorial , we will look at  IA -32 system architecture in brief . I would recommend anyone reading this blog to read the IA-32 system programmers manual . But I am planning to provide some links for quick reference .  I should not reproduce work that others have done very well. So here are the links

That’s all for now. I may edit this some time later 🙂

Linux kernel development – part 1

I will provide a hands on guide on dissecting and learning the Linux kernel . But I am busy with work and so forth and It may not be able to post in regular way. I however will not spoon feed and I will just be providing general guidelines. Also I will covering the linux kernel specific to the ia-32 platform .

Getting Started — Compiling the Linux Kernel

Download the newest linux kernel from www.kernel.org.make sure that you have newer version of gcc. To find out the current gcc version ,type
gcc --version.
To extract do ,
gunzip linux-2.6.xxx.tar.gz
tar -xvf linux-2.6xx.tar

and make your choices and save and exit.

If do not have ncurses library installed
type
make config

if present type
make menuconfig

For a qt based utility type
make xconfig

for gtk based type
make gconfig

if(kernel series is 2.4 )
make depend

Now finally build the kernel
make
make modules (for building modules)

Thus your kernel is made.Installation of newer kernel is bootloader specific grub has a different way,lilo has a different way.But the basic idea is same.For older kernel’s 2.2 vesrions etc we can put this kernel image directly into floppy for testing

Files and Directories to peek at

After extracting and compiling the kernel code, it’s time to peek at the different directories.
Lets discuss them
1)arch -Architecture dependent code
2)cryto -crypto API
3)Documentation-Docs
4)fs -The Virtual File System Implementation
5)init -the boot process and initialization
6)ipc-Inter process Communication
7)lib -Routines..
8)kernel -core systems
9)mm-Memory management
10)net-networking
11)Scripts
12)Security
13)Sound
14)user

Brief on gcc inline assembly

The basic gcc inline assembly format is as follows

asm  (
<IA 32 instructions >:
<output > :
<input >:
<clobbered>
);

you can search the net in detail on how to use inline assembly with gcc.

Frequently used data structures in kernel

  • Linked Lists   :-  See  /include/linux/list.h.The C language does not enforce type safety, Linux kernel provides set of macros for CRUD operations on a linked list
  • Trees             :-    Linux kernel uses balanced trees  ( RED BLACK TREE) in memory management . We will see this later

In Part2 we will focus more on the x86 architecture

System Programming , Game Development and Kernel Development

Practical Compiler Development Part 1

This is a rather informal introduction to development of a hobby compiler . The more formal chapters on compiler development will be given in later tutorials. By the end of this tutorial you will be able to create a simple interpreter . This can be easily converted into a compiler. The requirements for following this tutorial are

1. A reasonable grasp of C / C++ / Python / C# / Java

2. An elementary knowledge of basic data structures and recursion

Compiler defined

A compiler is defined as a program that converts a source file in a specified form to an output file that can be fed to an assembler or simulated by a machine (virtual / real). Most of the compiler convert the given source file into assembly language that can be assembled by an assembler of the target machine.

For example to see the code generated by your source program in Turbo C++ , type the following

tcc -S source.c

Phases of a compiler

Lexical analysis : The lexical analysis phase of the compiler essentially does the tokenizing . This phase takes a stream of text as input and converts it into tokens required for the parser . For example the lexical analyzer splits the C source into if , else , while , int etc . The lexical analyzer can be constructed using a hard coded dfa or by using a generator like lex . In the next part of this tutorial I will describe how to write your own lex like generator . In this tutorial we will use an ad hoc lexical analyzer which will be sufficient for the purpose of demonstration. In fact you will be surprised to see that this sort of approach is taken in small hobby compilers like small- C .

Syntax Analysis : The lexemes must be arranged in such form that reflects the syntactic structure of the language . This is called syntax analysis or parsing . In this tutorial we will discuss only a form of top down parsing known as recursive descent parsing with one look ahead . Every programming language has a grammar which is usually represented in BNF form . A CFG grammar of a programming language is defined by the following

1. A set of terminal symbols that cannot be substituted

2. A set of non terminals which can be substituted by terminal or other non terminals

3. A set of Productions that define the grammar (of the form NT -> A | B , ie LHS should have

only one nonterminal)

4. A start symbol from which everything begins

Eg A grammar for the string aaaa ..any number of times is

S -> “a “ S { terminals are enclosed in “ “ and non terminals in capital letters }

{ | used to denote OR }

S -> nothing

ie this means that S can be an “a” followed by S itself or S can be nothing .

Lets see how aaa is derived

S->aS => S->aaS ( using S -> aS) => S->aaaS(using S->aS) => S->aaa (using S-> nothing_left) .

Now lets see how a recursive descent parser is constructed for this grammar. This is easily done by replacing each non terminal by function call and performing a match for the terminal

Void parseS()

{

match_current_input_token_for(“a”);

if( not_matched) error();

if(end_of_input) return ;

parseS();

}

In order to construct a recursive descent parser like the above the grammar should follow these

conditions .

For each production of the form S – > A | C | E .

The first sets of each non-terminal in each of this production must be disjoint

ie FIRST (A) intersection FIRST (C) intersection FIRST(E) = NULL

The first sets and the follow sets of a nonterminal must be disjoint ie …

FIRST(A) intersection FOLLOW (A) = NULL

FIRST(C) intersection FOLLOW ( C ) = NULL

…so on …

This implies that the first set should not be equal to the follow set of the non- terminal.

The the first set of a terminal is the symbol itself. The first set of the non – terminal is the set of terminals with which the nonterminal can begin with . The follow set of a non terminal is the set of terminals which can come after the non terminal. Lets find out the First n Follow for the following

grammar .

S -> “a”A | C D

A->”b”A |”c”

C -> “e”|nothing

D->”d” | “g” D

FIRST(S) = FIRST ( “a”A) union FIRST(CD)

= “a” union FIRST( C )

= “a” union FIRST ( C ) union FIRST (D) { since C can be nothing , first of D must be

included}

= “a” union FIRST ( “e” ) union FIRST (“d”) union FIRST (“g”)

= “a” union “e” union “d” union “g”

= { a , e , d , g }

Similarly FOLLOW(C) = FIRST(D) = { d , g } etc … Computing the first and follow of the rest of the symbols in the grammar is left as an exercise to you . These rules have to followed in order to avoid conflicts in the predictive parsing table. More about advanced parsing methods in later chapters .

Now i present a more formal algorithms for finding first and follow .is given below . You might find it easier to follow the informal definitions that the algorithms given below.

Finding all nothingable non terminals

1) if given A -> nothing the set NOTINGABLE(A) = true

2) while no changes made in NOTINGABLE table

3) if production such that A -> A1 A2 A3 … Ak and all of NOTHINGABLE(A1)

NOTHINGABLE(A2) … NOTHINGABLE(Ak) = true

4) then NOTINGABLE(A) = true

Finding first

1) if terminal then set FIRST (Symbol) = symbol

2) while no changes made in FIRST table

3) if production such that A -> A1 A2 A3 … Ak

4) FIRST(A) = FIRST(A) union FIRST (A1)

5) while (NOTHINGABLE(A1 ) , NOTHINGABLE(A2) .. NOTHINGABLE(Aj). = TRUE)

6) FIRST(A) = FIRST(A) U FIRST(A1) U FIRST (A2) U FIRST (A3) …

FIRST(Aj+1)

Finding follow

1) follow is not defined for terminals

2) follow of the Start symbol is the end of input marker

3) if there exits a nonterminal A such that B -> E A D then

4) add FIRST(D) to FOLLOW(A)

5) if there exists a production such that A -> G H such that NOTHINGABLE(H) = true then

6) everything in FOLLOW (A) is also in FOLLOW (H)

As a final example, grammar for if statement.

S -> IFCONDTION

IFCONDITON -> “if” EXPRESSION “then” STATEMENT

EXPRESSION -> TERM “+” TERM

TERM -> FACTOR *FACTOR

FACTOR -> id

EXPRESSION -> “(“ EXPRESSION “)”

STATEMENT -> id “;” STATEMENT | nothing

It can be seen that right recursion in many cases is unneeded and can be removed using a simple loop .

As right recursion leads to tail recursion. Eg

The grammar S ->aS | nothing can be more compactly written as S-> (a)*

The parser reduces to

parseS()

{

while(current_token == “a” ) advance_input_pointer();

}

Left recursion will lead to an infinite loop and should be removed .For eg the grammar

S -> S B | D

can be re written as

S -> D S_DASH

S_DASH -> B S_DASH

Common factor in grammars should also be removed eg for the grammar

S ->” if” condition “then” statement

S -> “if” condition “then” statement “else” statement

can be modified into

S->AB

A -> ” if” condition “then” statement

B -> “else” statement | nothing

Semantic Analysis : Syntax analysis does not tell anything about the meaning of the statement under

consideration . It simply check whether the syntax is correct . For eg

const int i = 20;

i++;

The above lines are syntactically correct but semantically wrong . The semantics are sometimes given

using attribute grammars . .More about this later.

Code Generation : This is the final phase where the assembly code to the target machine is generated . I have skipped many other phases which are part of a more mature compiler. The purpose of this tutorial is to give quick introduction . The most of the hobby compilers the major component is the parser and code generation and semantic analysis is done when each non terminal is encountered in the parser.

As an example of developing a simple interpreter , I have provided the source code of a simple basic like interpreter. I have used the mingw port of gcc for compiling the source . I used dev-cpp as the ide .

The corresponding project and source files are also submitted .
Download the source code in pdf format.clicky , Basically , what you have to do is copy paste the code and compile it with a modern C++ compiler .