gtseytin-blog

Tuesday, June 30, 2015

Do we have real processing of natural language?

These notes are, in a way, a response to John Ball's article "Blame Chomsky for non-speaking A.I." (http://www.computerworld.com/article/2929085/emerging-technology/blame-chomsky-for-non-speaking-ai.html). Like John Ball, I feel frustrated about the way current AI resources are spent, still not leading to AI systems' ability to use natural language more or less like humans. But my background and history are different, so I have a different perspective; and of course, it is not about placing the blame.

For me, the groundbreaking achievement of Chomsky's phrase-structure grammars was the fact that description of natural language could be done in mathematically tractable terms, instead of seeing the language as something vague and unruly. This defined my transition into the field from mathematical logic. So I became interested in computer processing of NL texts, though I didn't follow the path of what seemed to be the mainstream research. Context-free grammars are very close to what the language has but not ideal, so a modification is in order. But not through context-sensitive grammars or another mathematically related class. The increased interest in Chomsky's hierarchy is for me an example of research inertia. And besides, I worked mostly with Russian language, for which dependency grammars seem to be more appropriate than phrase-structure. The principal obstacle being detection of the syntax structure of the text, I ended up with the approach known as the "filter method": describe the syntax as a set of logical restrictions on the syntax tree, and look for the one for a given sentence that satisfies those restrictions.

As is known, this easily leads to a combinatorial explosion, and I tried to work around this problem (even built a primitive SAT-solver). But being constrained in computer resources, I had to abandon this approach and look for an alternative (after all, I couldn't believe a human brain that can easily parse a sentence can use a huge computational resource). But later I could see the end of the journey I had abandoned. An elaborate grammar (HPSG), based on a big text corpus, and also a huge SAT-solver, producing lots of alternative parsings (most of them making no sense -- semantics was not used). All very nice, but what next?

Next, the <b>meaning</b> is still the goal, even if we don't know exactly what it is. We had some successful systems showing understanding of the meaning in a restricted domain, like T.Winograd's SHRDLU. I don't think they were very appealing for linguists whose ambition is usually to give as comprehensive as possible a description of the whole language, not of a specific narrow domain. But to have a deep semantical theory for the whole language would mean to have a theory for everything the language can express, i.e., for everything we know. So I think we should build meaning models for restricted domains, eventually extending and/or coalescing them, probably with a methodology for conquering new domains. Unfortunately, I don't see this happen.

The next chapter comes with the interest of big businesses in "text analytics" or "handling unstructured data". The big bosses don't care about the subtleties of grammars believing that their money and computer resources will trump all. Text processing once again becomes mostly keyword-based and using every shortcut (like "bags of words") to avoid deep linguistic analysis. And they proudly call this NLP! This also resulted in new disciplines like "named entity recognition" which do carry some semantics, but little else.

Now forget NLP, there is NLU (natural language understanding)! Personal assistants in handheld devices use the power of speech recognition... to call their pre-existing apps. You say a command, the system figures out which app to call and with which arguments. But I don't see any point in using natural language to talk to an old-fashioned app. The language expects much more. I imagined an airline reservation system to which the customer says at some point: "I would accept this flight unless you have an earlier one". And the system replies "We do have an earlier flight but it lands at a different airport". Once I gave this example to a developer of a personal assistant and was told that this would be a too complicated "use case". If the system were able to process meaningfully parts of the sentences rather than seek a predefined use case it would know how to handle such situations.

So we are not yet there, and probably not moving that way.

I agree with John Ball that we nee a more refined description of syntactic links. Indeed, when I worked with the HPSG parser mentioned above, I was adding some semantic-based filtering of the parser outputs and also tried to obtain more details about the syntactic roles, but this was not part of the project. If I could apply semantic filtering before the call to the SAT-solver it would take away much of the SAT-solver complexity and time. This also suggests that the linguistic analysis doesn't need to follow the "levels" (first morphology, then syntax, then semantics, ...) but better can use information from different levels concurrently.

I had little time to study the patom model, and only had a superficial glimpse of the RRG. My own tools and observations from previous research can be found on www.academia.edu.

Gregory Tseytin (more about me is on LinkedIn)

Tuesday, August 14, 2012

TOWARDS A SYSTEM THAT REALLY UNDERSTANDS NATURAL LANGUAGE

The recent years have seen substantial progress in natural language processing, including practical language translation based on statistical methods and also phones with spoken input in natural language. But weaknesses of these systems are also known. The translation engine uses combinations of words and their counterparts in the other language mostly correctly but sometimes produces embarrassing errors because it doesn't try to understand the texts. And spoken input to the phones is mostly regarded as a gateway to applications already existing in the phone where understanding is restricted to the needs of each particular application.

A number of known techniques are usually listed as the required "skills" in natural language processing, like named entity recognition or co-reference resolution. However most of these techniques are regarded as imprecise tools that are inherently subject to errors. By extension, the whole field of natural language processing is considered imprecise and uncapable to make reliable conclusions or decisions on important issues.

But why? Why is it that we can make most precise statements in the field of law or science that can be interpreted by humans without such fears but cannot extend this level of precision to computer applications? My explanation is that each computer application takes only a slice of the language whereas people use the complete language. We have elaborate and nearly complete computer dictionaries, even showing relationships between words, and nearly complete grammars (probably not covering some colloquial abbreviations). But neither of them truly goes into the meaning, except for some well-described semantic relationships. In the academic tradition, one strives to describe the language as completely as possible but this cannot extend to real-world meanings because then we would have to describe everything in the world for which the language is used.

And this creates a barrier for the abilities of "rule-based" NLP. A well-developed parser can thouroughly analyze a complex sentence but produce a great number of possible interpretations which are unexpected and absurd for an understanding human reader; in addition, it spends lots of time and memory to process those spurious interpretations. And even given the whole list of possible parsings we cannot do much to select automatically the one originally intended.

Is there a way to develop an integrated approach to natural language to achieve adequate understanding of the texts? We cannot expect (at least at this time) to build a full model of the world known to humans, but we can start with smaller domains or applications. Would this be different from current phone apps? I think is must be different. If the application (probably developed before natural language interfaces) is too narrow-minded (e.g., just expects agruments for a specific function call) there is no point speaking to such an application in a natural language; menus and forms with fields would be better. The application itself must be rethought to warrant NLP use.

I recall my recent discussion with the CEO of a startup who planned to outperform latest systems like Siri. He was absolutely convinced that what a language processor should do was, first, to identify the application to be called (give him a method!) and then to extract arguments for the call. The only thing that embarrassed him was that the speaker could deviate from the expected response. To me, the problem was just to understand what was said each time. I suggested a case, for a flight reservation system, when after some discussion the customer says "In case you don't have an earlier flight I will accept this one", upon which the system would respond "We do have an earlier flight but it arrives at a different airport". I didn't see any problem understanding the meaning of the customer request by connecting the components of the request to the appropriate database entities. But for the CEO it was absolutely outlandish, and we never talked again.

What can we do for a system that really understands the language? We need to select an application domain broad enough to warrant the use of a natural language but still representable in a computer. So we are still slicing natural language but not by the types of linguistic phenomena (vocabulary, syntax, etc.). Over time, smaller domains can coalesce.

Selecting a domain and providing sample dialogs might not be an easy task, and indeed it may take as much ingenuity as developing a new human interface. In my old research (see The Prague Bulletin of Mathematical Linguistics, 65-66, pp.5-12 (1996), or http://www.math.spbu.ru/user/tseytin/mytalke.html) I used (like many other researchers) school problems about moving objects, and in trying to connect linguistic entities with mathematical models I got a number of useful insights. (Authors of school problems are often particularly inventive in squeezing complex mathematical dependencies into concise descriptions.)

We will still need a parser, but not necessarily of the usual type. Once we identify a probable syntactic entity we need to immediately check whether it makes sense in our problem domain, and so avoid spurious variants. Moreover, based on statistical techniques, we might be able to start with units bigger than words, for which we might have ready semantic counterparts. Once the parse is complete we will immediately have a meaningful object in the problem domain and we will proceed with its domain-specific processing.

So, is there anyone to pursue this course?

Saturday, July 2, 2011

For English-only speakers

The previous comment is in Esperanto. You can just skip it and read what was before.

Ĉu Esperanto vere malvenkis?

Jen miaj kelkaj impresoj de la lasta usona esperanta kongreso en Emeryville. Inter la variaj programeroj, jen edifaj, jen amuzaj, jen nutraj, mi volas noti la diskutojn pri la pozicio de Esperanto en la moderna socio (gviditajn de profesoro Humphrey Tonkin) kaj la dokumanta filmo de Sam Green "The Universal Lanugage".

Profesoro Tonkin prave notis ke pro modernaj komunikiloj (la Interreto, retserĉiloj, sociaj retoj), la rolo de Espeanto profunde ŝanĝiĝis kaj la malnova modelo de hobiula rondeto ne plu sufiĉas, kaj novaj formoj de Esparantaj aktivecoj estas bezonataj kaj jam ekestas. Tamen li notis bedaŭre ke Esperanto estas malprestiĝa (kaj li pravas, almenaŭ rilate al sia propra medio de altrangaj akademiuloj -- sed antaŭ kelkaj jaroj juna esperantisto plendis pri alia kaŭzo de malprestiĝo de Esperanto inter liaj samlernejanoj: manko de rapmuziko en Esperanto; sed ĉi lastan efike malpruvis la filmo de Sam Green).

La filmo "The Universal Lanugage" (farita de neesperantisto) montras multajn sukcesojn de la lingvo sur variaj terenoj sed samtempe klarigas ke ĉio ĉi estas subtenata de kontraŭreala ESPERO: la lingvo jam malvenkis kaj la aspirata rolo de la internacia lingvo estas jam okupita de la angla lingvo. Sed antaŭ ol diskuti eblajn evoluojn de la situacio en la estonto mi demandu: ĉu vere Esperanto rivalas kontraŭ la angla lingvo? La celoj kaj la funkciado de tiuj du lingvoj estas malsamaj. Kaj ĉiu el ili helpas homan komunikadon sur sia tereno. Do la du lingvoj (kune kun ĉiuj aliaj) fakte kunlaboras anstataŭ rivali. Mi vere ne komprenas tiujn homojn kun la rivaleca mensinklino.

Ofte homoj ŝance aŭdantaj ion pri Esperanto ekopinias ke tio estas iu nova propono pri enkonduko de nova lingvo; sed vidante ke post kelkaj monatoj tio tamen ne okazas ili konkludas ke la ideo formortis. Sed ni scias ke tio ne estas afero de kelkaj monatoj aŭ jaroj. Dum la reala periodo de pli ol 120 jaroj Esperanto ne nur ne mortis sed atingis grandajn sukcesojn. Ĝi pluvivas malgraŭ profundaj ŝanĝoj en la socio kaj dominantaj ideoj, pluvias dum nuna monda globaliĝo. Ĝi ne pereis dum mondmilitoj kaj diversspecaj diktaturoj. Eĉ dum mia propra pli ol kvindekjara sperto mi vidas multajn ŝanĝojn. Multe malpli da fuŝa Esperanto kaj da malprofunda senimaga "poezio" en Esperanto. Mi ne plu aŭdas bigotajn asertojn de eksteruloj ke Esperanto estas denaske morta ĉar ĝi ne havas sian popolon kaj do estas nur artefarita skemo sen sia interna "animo". Multaj komputilaj softvaroj (ne faritaj speciale por esperantistoj), inkluzive tiun kiun mi uzas nun por tajpi ĉi mesaĝon, havas specialajn kapablojn por prezenti Esperantajn tekstojn kaj komuniki Esperante kun la uzanto. Kaj ĝia progreso ne ĉesis.

Kio do pri la estonto? Ni ne povas antaŭvidi la evoluojn, sed profundaj ŝanĝoj en lingvouzo kaŭzataj de novaj teknikoj pludaŭras, irante post la ŝanĝoj jam notitaj de Humphrey Tonkin. Jam ne sufiĉas nur transsendi diverlingvajn mesaĝojn laŭlitere. Pli kaj pli urĝa iĝas la bezono aŭtomate analizi la enhavon de la tekstoj kaj fari konkludojn aŭ decidojn surbaze de tiu enhavo. Jam multo estas farata por ebligi tian analizon sed bedaŭrinde la diverflankaj penoj (de lingvaj teoriuloj, de la "semantika TTT", de perfektigataj retserĉiloj, de "sentanalizo", t.e., distingo inter pozitivaj kaj negativaj rimarkoj pri produktoj de iu firmao en sociaj retoj) restas disaj kaj ne facile kunigeblaj. Tiuj procezoj nepre ŝanĝos nian lingvouzajn praktikojn eĉ kvankam ne estas klara kiel ĝuste. (Ne esperu tamen ke reguleco de Esperanta morfologio donos al ĝi avantaĝon en komputila prilaboro: listoj de ĉiuj neregulaj vortformoj estas facile memoreblaj en maŝinoj, kaj Esperanta gramatiko jam elkreskis de "16 reguloj" ĝis "kilogramatiko").

Kaj pli fora estonto? Kiam supozeble la tuta homaro parolos unu lingvon? Ĉu tio estos Esperanto? Apenaŭ. ĉu la angla? Nekredeble. Nun la angla lingvo superas, ĝi estas fakta internacia lingvo inter bone edukitaj homoj. Ankaŭ abundas fuŝa angla (kaj mi ne opinias ke denaskaj anglalingvanoj estas tre kontentaj pri la "internacia" versio de sia gepatra lingvo). Sed la supereco de la angla lingvo estas bazita sur la ekonomia potenco de Usono -- kaj tio povas ŝanĝiĝi, kaj pro la kresko de iam subevoluintaj ekonomioj, kaj pro la detruaj procezoj en Usono mem (kontraŭnovigaj politikoj de grandaj monopoloj, senprestiĝiĝo de persista laboro). Kaj politikistoj estas ĝuste rivalemaj homoj kiuj ne preterlasos ŝancon detronigi la anglan lingvon. Esperanto ja estas en pli sekura pozicio.

Alia demando estas ĉu iam venos la tempo de unulingveco. Grandaj teknikaj ŝanĝoj povas eĉ forigi la bezonon pri tio. Imagu ke, ekzemple, ĉiu parolos sian propran lingvon kaj aŭtomatoj tradukos tion por ĉiu alia. Aŭ ke venos aliplanedanoj kaj ŝanĝos ĉion. Prefere ne fantaziu sed faru nian laboron nun.

Mi volas citi interesajn notojn el latinlingva novaĵretejo "Nuntii Latini" (bazita en Finlanda radio). Foje ili diskutis multlingvecon en la Eŭropa Unio kaj kalkulis ke 506 partradukoj estas bezonataj inter ĉiuj el la oficialaj lingvoj. Kaj notis (4.2.2005):

"Quae cum ita sint, plerique magistratus in rebus Unionis agendis lingua Anglica, rarius Francica aut Germanica utuntur. Sed praesertim Franci superioritatem Anglicae linguae concedere nolunt. Sunt quidem qui censeant linguam Latinam in usum Unionis Europaeae resuscitandam esse, sed voluntas politica ad rem efficiendam deest."

(Tamen fakte plejparto de oficejoj por la uniaj aferoj uzas anglan lingvon, pli malofte francan aŭ germanan. Sed precipe francoj ne volas konsenti al supereco de angla lingvo. Estas iuj kiuj opinias ke necesas restarigi latinan lingvon por uzo de Eŭropa Unio, sed mankas politika volo por fari tion.)

Kio do pri Esperanto? Verŝajne la ĉefa obstaklo estas la pigreco de superaj oficistoj kiuj ne volas lerni novan lingvon (sed devigas pli malsupraj oficistoj en Luksemburgo lerni la luksemburgan lingvon).

Por konkludi mi notu unu gravan mankon en nia Esperanta agado: sciencteknika literaturo. Tio kion mi vidis plejparte estis de ekstreme malalta kvalito, kaj lingve kaj fake; nur unu fojon mi vidis seriozan libron en Esperanto, diversfakan kaj bone verkitan (bedaŭrinde mi ne plu havas ĝin). Oni sendube agnoskas la bezonon de fakaj terminaroj, produktas tiujn amasigante la terminoj, sed ili ne vivos sen reala uzo. Antaŭ kelkaj jaroj mi vane serĉis Esperantan vorton por tio kio angle nomiĝas "stem cell" (mi fine skribis "tigoĉelo"); eble oni jam aldonis la terminon ie. La kaŭzo estas klara: la internacia scienca komunumo plejparte uzas senprobleme la anglan lingvon. Kompreneble ni ne havas kapablon krei nun respektindan "egalul-recenzatan" fakan sciencan publikaĵon, sed eble mi povas celi altnivelan sciencan informilon ne por samfakuloj sed por diversfakuloj, seriozan kaj sen "beblingvaĵo". Ĉu tio eblas?

Tuesday, June 14, 2011

A TESTING CHALLENGE FOR DEDUPLICATION SOFTWARE

Deduplication is one of the relatively recent technologies
used for data backup in order to save space and data communication
bandwidth (see, for example,
www.usenix.org/event/fast08/tech/full_papers/zhu/zhu_html). The
idea is to split the file to be backed up into "chunks" at some
point defined by the contents (not by the offset from the start),
so in case of changes involving insertion or deletion of some
portions of the file the splitting points will move to preserve
the unaffected chunks from the changes. Then for each chunk we
compute the hash value, e.g., the 160-bit SHA-1 checksum which we
store along with the chunk and use as a key for chunk search. If
the same key value is found next time we will assume that the chunk
is the same and will not transmit or store the chunk again.

The risk is that the same hash code will come from a completely
different chunk, but with 160-bit well randomized code the probablility
is much smaller than of any other possible adversity. However, if that
happens we will fail to save the new chunk and on an attempt of
recovery the old chunk will be returned instead of the right data.

So, what will happen if a backup provider chooses a smaller key size,
e.g., 32 bits? The probability of hash collisions will be much higher,
and they will probably show up as the number of the stored chunks
increases. Let us assume that we already have 2**16 (i.e., 65536)
chunks. We want the keys to be all different. For the first key there
are 2**32 possibilities out of 2**32, for the second key to avoid
the collision there will be just 2**32-1, for the next key 2**32-2
out of 2**32, etc. The probability of avoiding collisions with n keys
will be the product of (1-i/2**32) with i ranging from 0 to n-1.

Let ua use logarithms. We know that ln(1-i/2**32) is less than
-i/2**32. The sum of the logarithms will be less than -n*(n-1)/2**33.
With n equal to 2**16 it will be about -1/2, giving the probablility
of no collision of exp(-1/2), i.e., about 0.606. This brings the
probability of at least one error to about 40% with as few as 64K
chunks.

Can the error be detected? Not certainly because apart from having
the colllision and not storing the right chunk the customer will
need to request the recovery of the file containing the chunk, and
it doesn't happen very often (in fact, depending on whether the customer
uses the backup for emergency recovery or just to put away files not
immediately needed).

So let us assume that someone deploys the deduplication system with
just 32-bit keys. The system will be tested but, most likely, the
number of the files backed up and reovered in the testing will be
much less than 2**16 chunks. So the tests will succeed and the system will
be possibly accepted. But then in real use the chunk count will quickly
reach and exceed this value eventually leading to a mass loss of stored
data.

One more question is whether someone is going to deploy a system with
such short keys. But, believe it or not, I saw a company contemplating
just that.

So the challenge is: what kind of testing can detect this type of fault?

Saturday, June 11, 2011

Hello

Hello! My name is Gregory Tseytin, and I am going to share my thoughts both about my profession (computer science/software engineering/computational linguistics) as well as about other matters of public interest. See you soon.