Interlingua MT: Translation of Numbers
Topics:
Number systems
Grammar for numbers
Parsing
*
Interlingua MT:
Translation of Numbers
Semantic processing
Generation
MT “pyramid”
(revisited)
Source language
Target language
Interlingua
Transfer: deeper rep. Transfer: semantic rep.
Transfer: functional structure Transfer: phrase structure
Direct translation: word for word
translation
No transfer process
needed for interlingua
*
Interlingua MT
Interlingua
Language1
Language2
Language3
Others …….
Advantage of interlingua: Adding a new language needs only one more language pair: new language €€ Interlingua
*
*
What is interlingua?
An interlingua is supposed to be a universal
representation for … What?
meaning, of course
but what is meaning?
Under the circumstance of no clear meaning for
“meaning”, we may describe interlingua as
a universal representation for what can be conveyed
through human language communication
Question:
What can be conveyed by our languages?
How to design an interlingua?
Any clear idea about it? No
What we are sure to know is its
universality and
versatility
Think about the following
ontology of human knowledge
conceptions of what we know and can express via speech
ontology of objects in the world and in our languages
ontology of events
ontology of words, etc.
Any example to help us understand it any better?
*
Interlingua MT for numbers
Interlingua:
*
Values
English numbers
Chinese numbers
Others …….
Arabic numbers
Used as universal Arerparbeiscenntuamtiobnerfsor values
Number systems
Decimal numbers
Arabic numbers
Yes
Chinese numbers – ?
• <= 10,000, yes
• > 10,000, still?
English numbers – ?
• <= 1,000, yes
• > 1,000, still?
The distinction between the two can be exemplified by the difficulties in converting or translating between them.
Basically yes, but with quite some variation!
What is the difference?
*
*
What define a number system?
Base
the set of digits (or, base symbols) used
the cardinality of the digit set (i.e., the number of digits)
decimal numbers
base 10
digits: {0,
1, 2, 3, 4, 5, 6, 7, 8, 9}
each digit has its own digit value.
Position
the place where a digit shows up. 2 3 3 8 8
each position has its position value:
|Base|Pos 4 3 2 1 0
What value does a digit represent?
2
3
3 8 8
0
4 3 2 1
8×100
8×101
3×102
3×103
2×104
Digit value
Position value
Digit value
A digit represents different value when showing up in
different position
Position value
*
Digit x |Base|Pos
What is the value of a number?
All number’s value = sum of all its digits’ values. E.g.,
23,388
= 2×104 +
3×103 +
3×102 +
8×101 +
8×100
= 23,388
Hei! So trivial!
What kind of game are you
playing?
10
Let us play with binary numbers
Base 2
Digits =
{0, 1} (i.e., only 0 and 1 appear in a number)
Still trivial?
All computers play such a game.
How about numbers on other bases?
*
11,111 = 1×24 +
1×23 +
1×22 +
1×21 +
= 1×20
31
*
Octal numbers
{0, 1, 2, 3, 4, 5, 6, 7}
Base 8
Digits =
•
Numbers:
0, 1, 2, 3, 4, 5, 6, 7,
10, 11, 12, 13, 14, 15, 16, 17,
20, 21, 22, 23, 24, 25, 26, 27,
30, 31, 32, 33, 34, 35, 36, 37, ……
• 3578 =
= ? 3×82
+ 5×81
+
7×80
=
23910
*
Hexadecimal numbers
{0, 1, 2, …, 9, A, B, C, D, E, F}
Base 16
Digits =
•
Numbers:
0, 1, 2, … 9, A, B, C, D, E, F
10, 11, 12, … 19, 1A, 1B, 1C, 1D, 1E, 1F
20, 21, 22, … 29, 2A, 2B, 2C, 2D, 1E, 2F
……
• 35716 = ?
= 3×162 + 5×161 + 7×160 = 85510
*
Chinese numbers
Base 10, basically
• Digits = {零, 一, 二, 三, 四, 五, 六, 七, 八, 九}
Another set of digits: {壹, 貳, 叁…, 玖}
Position
Positions in Chinese numbers are explicitly
expressed
• Positions: {個}, 十拾, 百佰, 千仟, 万萬, 亿億, 兆
• Position values: 1, 10, 102, 103, 104, 108, 1012
• E.g.,
五 千 六 百 七 十 八
= 5×103 + 6×102 + 7×101 + 8×100
= 5,67810
A grammar for Chinese numbers
G –> Digits
S –> {G} 十 {G}
B –> G 百
B –> G 百 S
B –> G 百 零 G
Q –> G 千
Q –> G 千 B
Q –> G 千 零 S Q –> G 千 零 G
W –> Q/B/S/G 萬
W –>
W –>
W –>
W –>
Q/B/S/G 萬 零
Q/B/S/G 萬 Q Q/B/S/G 萬 零 B Q/B/S/G 萬 零 S
G
CCoonnjjuunnccttiiioonn,, nnoott zzeerroo!!
*
Large numbers in Chinese
W –> Q/B/S/G 萬
W –> Q/B/S/G 萬 Q
W –> Q/B/S/G 萬 零 B W –> Q/B/S/G 萬 零 S W –> Q/B/S/G 萬 零 G
Q/B/S/G 兆 零
Q/B/S/G 兆 零
Z –> Q/B/S/G 兆
Z –> Q/B/S/G 兆 Y
Z –> Q/B/S/G 兆 零 Y Z –> Q/B/S/G 兆 零 W Z –> Q/B/S/G 兆 零 Q Z –> B
Z –> G
Problem:
Ambiguity in analysis
*
Y –> Q/B/S/G 億
Y –> Q/B/S/G 億 W
Y –> Q/B/S/G 億 零 W
Y –> Q/B/S/G 億 零 Q
Y –> Q/B/S/G 億 零 B
Y –> Q/B/S/G 億 零 G
Solution
W –> B/S/G 萬
W –> B/S/G 萬 Q
W –> B/S/G 萬 零 B W –> B/S/G 萬 零 S W –> B/S/G 萬 零 G
Y –> B/S/G 億
Y –> B/S/G 億 WQ
Y –> B/S/G 億 零 W Y –> B/S/G 億 零 Q Y –> B/S/G 億 零 B Y –> B/S/G 億 零 G
YQ –> Q 億
YQ –> Q 億 WQ YQ –> Q 億 零 W
Z –> Q/B/S/G 兆
Z –> Q/B/S/G 兆 YQ
Z –> Q/B/S/G 兆 零 Y Z –> Q/B/S/G 兆 零 W Z –> Q/B/S/G 兆 零 Q Z –> Q/B/S/G 兆 零 B Z –> Q/B/S/G 兆 零 G
*
WQ –> Q 萬
WQ –> Q 萬 Q
WQ –> Q 萬 零 B
WQ –> Q 萬 零 S
WQ –> Q 萬 零 G
YQ –> Q 億 零 Q
YQ –> Q 億 零 B
YQ –> Q 億 零 G
*
Chinese numbers => values
Two steps:
Syntactic analysis
Parsing: to derive a syntactic tree (called parse tree) for an input sentence / number.
Result: a phrase structure tree.
Semantic interpretation:
To convert the parse tree
into a semantic / meaning representation,
namely, a value.
Semantic rules for interpretation
We need to define a semantic rule for each grammar rule to specify
how a phrase structure under the grammar rule is
interpreted into a meaning representation, i.e.,
how to convert a syntactic structure into meaning.
Z –> Q/B/S/G 兆 零 Y
sem(Z)
= sem(Q/B/S/G 兆 零 Y)
=
sem(Q/B/S/G)
x sem(兆) +
sem(Y)
*
Example: parsing
三
千
兆
零
六
百
億
Q
G
Z
B
G
Y
x
x
+
x
x
20
Example: semantic interpretation
三
千
零
六
億
B
x
x
+
x
x
*
=3
G =3
Q
=3×103
兆
=103 =1012
=6
=6×102
百
=102 =108
Y=6×1010
G =6
Z =3×1015 + 6×1010 = 3,000,060,000,000,000
=3×1015
*
Generation (i): Head
Given an Arabic number, generate its Chinese counterpart
Format: N = head * pos + tail
Denoted as: head(N, pos) and tail(N, pos), respectively
Given an input number X, how generate it? Heads and then tails
1012|8|4
1012|8|4
integer division!
remainder!
head(X,兆|亿|萬) = X /
tail(X,兆|亿|萬) = X %
Generate its head
gen(head(X,兆|亿|萬))) a Q-number < 104
Generate its tail
gen(tail(X,兆|亿|萬)) a number < 1012|8|4
*
Generation (ii): Tail < 104
Generating a Q-numbers X
< 104
head(X,千/百/十) = X / tail(X,千/百/十) = X %
103|2|1
103|2|1
Generate its head
gen(head(X,千/百/十))) a Q-number < 10
Generate its tail
gen(tail(X,千/百/十)) a number < 103|2|1
Generation: example
1. X=123,456,789,123,456,789
gen(X) = gen(head(X,兆) 兆 gen(tail(X,兆)
= gen(123,456) 兆 gen(789,123,456,789)
2. X=123,456
gen(X) =
=
gen(X,萬) 萬 gen(tail(X,萬))
gen(12) 萬 gen(3,456)
3. X=12
gen(x) =
=
gen(head(X,十)) 十 gen(tail(x,十)) gen(1) 十 gen(2)
4. gen(1) = 一
gen(2) = 二
*
Example: generation of conjunction
gen(3,000,060,000,000,000):
head = 3000
tail = 60,000,000,000
gen(3000):
head = 3
tail = 0
gen(60,000,000,000):
head = 600
tail = 0
gen(600):
head = 6 六
tail = 0
三千
零六百
億
三千兆零六百億
千 三
零
兆
百
*
六百億
For Chinese, any time when tail is less than 1/10 of pos, insert a conjunction 零 to the output.
English part?
Interlingua:
values
English numbers
Chinese numbers
Others …….
Arabic numbers
???
*
*
Grammar for English number
Exercise
Design a grammar for English numbers, covering the range [0, 1,000,000,000,000,000-1], and
Use it to analyse the English number for
123,456,789,123 (or 123,456,789,123,456)
Design the generation procedure for English numbers and illustrate how it works for a real English number, e.g., 123,456.
*
Hints
In lecture on interlingua, the following was given as the starting point for your design of the grammar for English numbers for HW2:
D0 --> {zero}
D –> {one, two , .. nine}
D’ –> {ten, eleven, … nineteen}
T –> {twenty, …ninety}
and then five rules: H- –> D0 | D | D’ | T | T D
subsuming the following:
H- –> D0
H- –> D
H- –> D’
H- –> T
H- –> T D
to cover numbers under 100. (Do not add any extra symbol in a rule such as T –> D + D, which is wrong!)
Do not forget N –> H-, for N is our “axiom” (just like S for sentence). So are rules for Th-, M-, B-, etc.
Following the above fashion, we can have Th- –> D hundred {H-} for numbers in the range [100, 999]. As mentioned in class that people actually say “twenty hundred” and even “ninety nine hundred”, we can extend this rule into the following by replacing D with H-:
Th- –> H- hundred
Th- –> H- hundred H-
Th- –> H- hundred and H- (For British English)
Originally, Th- is defined to cover [100, 999]. Given the larger coverage of H- that that of D, the Th- rules have certain overgeneration to generate number beyond 999. But conceptually, simply thinking of Th- as for number under 1000 is fine for other rules.
You may merge them into one line (NOT one rule!) as:
Th- –> H- hundred {and} {H-}
where {} means optional. Please check if any number in this range [100, 999] missing before moving on to rules for M-, B-, etc.
接上页:
M- –> [ ] thousand
M- –> Th- thousand Th-
M- –> Th- thousand and H-
M- –> Th- thousand H-
整理为M- –> Th- billion
…..
[…]billion […]million […]thousand […]
注意:
箭头换成标准箭头符号
中文数字gen(23)
=gen(2,十)+gen(3)
=gen(2)+gen(3)
英文数字gen(23)
=gen(2,tens)gen(3)
=twenty gen(3)
gen(19)是直接得出19的