Archive for the ‘Science’ Category

50 Tons of Highly Enriched Uranium in 40 countries

August 4, 2008

According to presidential candidate Obama, 50 tons of loosely guarded highly enriched uranium (HEU) remains in 40 countries around the world and he wants to negotiate agreements to eliminate them in four years. A laudable goal, and how will we go about doing this? Will all these other countries willingly give up their HEU stocks? Will they decommission or convert their HEU-based operating reactors?

In this 1-page proposal, I explore how a single, fixed-price for HEU might solve the problem once and for all.

“We have decided to be happy because it is good for our health.”

July 9, 2008

Yesterday for the first time I read this quote written on the bottom of a sign inside Boulange Bakery on Union St (SF) — it was right above the counter with sugar, jam, cream, nutella, etc. Although I had visited the same bakery and spread raspberry preserves on my croissant there numerous times before, I had not noticed the quote until now — pure genius. We often look externally for happiness — is it social, cultural, economic, or geographic? Or is it really just as simple as deciding to be happy?

Study: World Gets Happier

Denmark is the happiest nation and Zimbabwe the the most glum, he found. (Zimbabwe’s longtime ruler Robert Mugabe was sworn in as president for a sixth term Sunday after a widely discredited runoff in which he was the only candidate. Observers said the runoff was marred by violence and intimidation.) The United States ranks 16th…. “The results clearly show that the happiest societies are those that allow people the freedom to choose how to live their lives,” Inglehart said.

The Keys to Happiness, and Why We Don’t Use Them

Happiness is 50 percent genetic, says University of Minnesota researcher David Lykken. What you do with the other half of the challenge depends largely on determination, psychologists agree. As Abraham Lincoln once said, “Most people are as happy as they make up their minds to be.”… One route to more happiness is called “flow,” an engrossing state that comes during creative or playful activity, psychologist Mihaly Csikszentmihalyi has found. Athletes, musicians, writers, gamers, and religious adherents know the feeling. It comes less from what you’re doing than from how you do it.

AVERAGE HAPPINESS IN 95 NATIONS 1995-2005
How much people enjoy their life-as-a-whole on scale 0 to 10

Top

> 7,7

Middle range

± 6,0

Bottom

<4

Denmark

8,2

Phillipines

6,4

Armenia

3,7

Switzerland

8,1

India

6,2

Ukraine

3,6

Austria

8,0

Iran

6,0

Moldova

3,5

Iceland

7,8

Poland

5,9

Zimbabwe

3,3

Finland

7,7

South Korea

5,8

Tanzania

3,2

Phrases, natural language processing, and artificial intelligence

July 3, 2008

Michael Arrington’s long interview with Barney Pell and the Microsoft product manager for search after announcing the Powerset acquisition gives hints as to the Powerset’s capabilities that MSFT search needs most vis-a-vis what Google already has incorporated into their search. A lot of it seems to fall into the “phrase” statistics category instead of “linguistic” understanding category — see below but also see this discussion about Powerset and other players like Hakia in semantic search. Given the performance of today’s mainstream search engines from Google/MSFT/Yahoo, I’m beginning to think that until the day we have artificial intelligence, phrases (bigrams, trigrams, etc) and related statistical techniques will remain the silver bullet for general purpose web search.

The Powerset/MSFT folks have a BHAG (big-hairy-audacious-goal) and it’s hard to fault them for that. The technical challenge that Powerset is running up against is the same sort of wall (i.e. steep learning curve) that shows up in modern computer vision and image recognition applied to things like image-search engines or face-recognition in photo collections. The point here is that the technology works in well-defined problem domains or with sufficient training/hints, and it is also massively parallelizable on today’s low-cost server farms for specific tasks — it either fails miserably or has degraded/unreliable performance when the software/code is not programmed to deal with increased scene or corpus complexity, i.e. combinatorial explosion of inputs that arises with diverse features. This happens when the scene or corpus is noisy and when scene/corpus features are ambiguous or new — exactly the situations where human beings still excel in making sense of the world through judgments, and why humans still write software. For example, see these papers from Stanford computer science class projects (1 and 2) that show how unreliable the face-recognition step can be for auto-tagging Facebook photo collections and in-turn how training can be used to achieve “100% Accuracy in Automatic Face Recognition.”

First of Powerset’s capabilities that MSFT wants is word-sense disambiguation — both inside user queries and in the corpus text — to produce more precise search results. Thhe example he gives is “framed.” Two attached papers listed below from the 1990s talk about two different approaches using statistical collocations (Yarowsky) and latent semantic indexing (Schuetze)… this is something MSFT may not be doing yet today, but Google is doing using perhaps Bayesian word clusters and other techniques.

RN: I think everything that Barney said was right on. I think you see search engines including Live Search and also Google and Yahoo are starting to do more work on this matching not exactly what the user entered but it is usually limited to very simple things. So now all of us do some expansion of abbreviations or expansion of acronyms. If you type “NYC” in a search engine these days, in the last couple years, it understands that it means the same thing as New York. These are very very simple rules based things, and no one understands that bark has one meaning if it about a tree and a different meaning if it is about a dog. Or an example that someone gave the other day was the question of “was so and so framed.” And framed could mean a framed picture or it could mean set-up for criminal activity that did not occur, and so on. And you have to actually understand something of it is a person’s name then it applies to one sense of the word framed if it is not then it doesn’t. So one of the things that Powerset brings that is unique is the ability to apply their search technology to the query to the user’s search in ways that are beyond just the simple pluralization or adding an “-ing” is that Powerset also looks at the document, it looks at the words that are on a web page and this is actually very important. If you look at just the users query, what you have available to you to figure what they are talking about are three words four words five words, maybe even less. That can give you certain hints. If you look at a web page that has hundreds or thousands of words on it you have a lot more information you can use if you understand it linguistically to tell what its about, what kind of quieries it should match and what kind of quieries it shouldn’t match. And Powerset is fairly unique in applying this technology in the index on a fairly large scale already and with Microsoft’s investment and long term commitment we can scale this out even further, an apply it even more of the web, not just the wikipedia content they have thus far.

Second is the potential to support complex but realistic queries in the future — they suggest that this can replace vertical search engines which seems a bit far out.

RN: I think what Peter Norwig is saying has some degree of accuracy and that he is also ignoring some things. So, just for normal queries, queries that are not phrased as questions, there is a lot of linguistic structure. If someone types in a query that is “2 bedroom apartments, under 1000 dollars, within a mile of Portero Hill.” That query is loaded with linguistic content. And that’s a realistic query. That is the type of thing that customers actually want to find on the web. Today there is a sort of helplessness, where customers know that certain queries are too complicated, and they wont even issure them to a search engine. They will go to some deep vertical search engine where they can enter different data into different boxes.

The “2 bedroom apartments, under 1000 dollars, within a mile of Portero Hill” example is interesting, since it requires logic and can’t be solved just using related keywords. Even with access to infinite compute cycles I don’t expect any general purpose search engine applied to the web corpus can do this query reliably today, including Powerset. For verticals like real-estate or travel, we will still need things like Craig’s list or Orbitz/Mobissimo that list classified ads and or build on top of user submissions, metadata, and targeted crawling.
Compared to phrases, linguistic analysis helps provide a better answer to the question of whether a given web page matches the user’s intended query or not. It certainly has the potential to improve recall through more accurate query interpretation and retrieval of web pages from the search engine’s index. By itself linguistic analysis does not improve the precision of search results (by re-ranking) in the way that hyper-link analysis did in Google’s PageRank and other web search engines.
The third use case is query expansion to related keywords using an automatically-generated thesaurus — which I thought Google is already doing.

nce; that is not really an area that is that interesting. But some of these more complex queries really are. For example, shrub vs. tree. If I do a search for decorative shrubs for my yard, and the ideal web page has small decorative trees for my garden, it really should have matched that page and brought it up as a good result. But today Google won’t do it, Yahoo won’t do it, and Live won’t do it. So even in these normal queries there is a lot of value in the linguistics.

I didn’t understand this “decorative shrubs” example. When I try it out on Google I find a result that seems relevant to me. Furthermore using the “~” option in front of any word, Google allows you to search using “related” keywords. If you try “decorative ~shrubs” you get results containing “decorative plants” as well.

Fourth, it seems 5% of queries are linguistic queries today (versus phrase queries), and they expect that fraction to expand as MSFT’s search engine capabilities mature to incorporate linguistic understanding of the query — mostly related to facts or solving problems.

BP: Let me answer the question of does anybody actually search this way. The answer is yes, people do this. It isn’t the most common mode, but we do see that probably 5% of queries are natural language queries. These are not all queries that are phrased in complete sentences, but they are queries where the customer has issued something that has some sort of linguistic structure. Almost any query with a preposition: X and Y, A near B, attribute A of Y, etc. Those things are loaded with linguistic structure…. RN: I have a list of some natural language queries in front of me. Can we just show you some queries that our customers have actually sent to us and are random examples. The first person to see the dark side of the moon. How to get a credit card in Malaysia. Enabling system restore in group policy on domain controller. Timeline of Nvidia. How to measure for draperies. What is the difference between Mrs. and women’s sizes? Does my baby have acid reflux? I could just go on and on and I. These fit in the category that we’ve labeled that match about five percent of queries and they’re really just cases where the customer can’t think of a simpler way to express it.

Finally, the people use case seems to be one of Powerset’s most visible strengths…

BP: We return answers. We actually synthesize, so if you were to say, “What did Tom Cruise star in,” you actually get not just the movies, but the cover art for the different movies. It synthesizes multiple pieces of information to give you a whole different kind of presentation. Or, if you were just to say, “Bill Gates” you’d be given an automatically generated profile of Bill Gates, pulled across many, many articles. It’s no longer just about 10 links, although we can certainly do more relevant job (and will) of the blue links, and a better job of presenting those links.

A friend of mine tried getting answers via Google, and got quick (< 5 minutes for all the searches) and relatively accurate answers without much hassle using phrases:

Those are examples that illustrate why phrases and statistical approaches are not only “good enough” for searching the web corpus, and but they have a very high “signal-to-noise” compared to all other machine search techniques that became available to us in the last 20 years.

The root issue seems to me to be that human symbolic and image comprehension is still well beyond today’s programmable machine capabilities — which has nothing to do with compute power or specific interface/modality (language or vision) — it has much more to do with human intelligence itself that resides inside the cerebral cortex and other parts of the central nervous system. I’d be willing to bet that to achieve their BHAG and make the real quantum leaps from today’s search technology that they advertise, Powerset + MSFT will need the equivalent of artificial intelligence, not just natural language processing (or better image processing in the case of scenes).

Powerset is not the only semantic search engine. Besides Hakia, there is Cognition Technologies — a company founded by a UCLA professor — that also has their own wikipedia search http://WIKIPEDIA.cognition.com. This white paper compares their results to Powerset, query by query, showing how Cognition’s precision is higher and most of the time recall is much lower (more relevant vs unrelated hits) — of course this is a test designed by Cognition, and therefore a display of their own strengths. In one example the show Cognition returning exactly one result and Powerset returning 259. Also see Top-Down and Bottom-Up Semantics by Cognition’s CTO. Cognition’s other white paper compares them query-by-query to Google, Yahoo, Verity, and Autonomy — see section VI. In these tests, only Google has an observable “statistical” boost but doesn’t quite work as well as Cognition — more precision and higher relevant recall. In summary, Cognition achieves these results using the semantic processing and natural language parsing technology described here,

Cognition’s Semantic NLP Understands:

  • Word stems – the roots of words;
  • Words/Phrases – with individual meanings of ambiguous words and phrases listed out;
  • The morphological properties of each word/phrase, e.g., what type of plural does it take, what type of past tense, how does it combine with affixes like “re” and “ation”;
  • How to disambiguate word senses – This allows Cognition’s technology to pick the correct word meaning of ambiguous words in context;
  • The synonym relations between word meanings;
  • The ontological relations between word meanings; one can think of this as a hierarchical grouping of meanings or a gigantic “family tree of English” with mothers, daughters, and cousins;
  • The syntactic and semantic properties of words. This is particularly useful with verbs, for example. Cognition encodes the types of objects different verb meanings can occur with.

Also see Q-Go, a European NLP technology company for customer service applications that partners with MSFT,

The intelligent, linguistic technology-based software that Q-go develops makes sure that customers receive a direct, immediate answer to all of their questions from a company’s website. Online, in their own language, and with at least the same comprehensiveness and quality as the answers provided by call centres. Not only is this easier for organisations, it’s also faster and cheaper. Q-go’s headquarters are located in Amsterdam, and the company has four local offices in Barcelona, Madrid, Frankfurt and Zurich.

A post by Don Dodge (MSFT) reveals that the Powerset’s “semantic rules” can be applied to MSFT’s existing index of the web and therefore likely to be as much about word word co-occurence statistics + clusters as it is using linguistic logic to gain an understanding from the corpus. The example is also illustrative of how Powerset breaks down a natural language query, and the post goes onto explain how Powerset may also be useful in vertical search applications…

Powerset is using linguistics and (NLP) to better understand the meaning and context of search queries. But the real power of Powerset is applied to the search index, not the query. The index of billions of web pages is indexed in the traditional way. The big difference is in the post processing of the index. They analyze the indexed pages for “semantics”, context, meaning, similar words, and categories. They add all of this contextual meta data to the search index so that search queries can find better results.

Who is the best ballplayer of all time? Powerset breaks this query down very carefully using linguistic ontologies and all sorts of proprietary rules. For example, they know that “ballplayer” can mean Sports. Sports can be separated into categories that involve a “Ball”. Things like baseball, basketball, soccer, and football. Note that soccer does not include the word ball, yet Powerset knows this is a sport that includes a ball.

Powerset knows that “ballplayer” can mean an individual player of a sport that includes a ball. They know that “best of all time” means history, not time in the clock sense.

Finally, an alternate theory that says Powerset was acquired for contextual ad placement, not search.

For the foreseeable future, phrases and statistical approaches will probably continue to deliver the greatest signal-to-noise-ratio for machine indexing of web content in the absence of a breakthrough in artificial intelligence. The evidence is not 100% conclusive, but during the last several weeks I’ve accumulated research papers to support this hypothesis…

Also some papers on techniques/tools for extracting keyphrases and performance evaluations that I came across…

  • Google made n-gram data available for all 5-grams (or smaller) appearing more than 40 times on the web, “All Our N-gram are Belong to You”
  • Open-Calais by Reuters can analyze for named entities, facts, and events and their API
  • The chapter “Natural Language Tools” in pages 149-171 in “Advanced Perl Programming” by Simon Cozens (O’Reilly) — you get a very “quick & dirty” introduction to a number of natural language processing concepts and ways to implement and play around with them. Although Perl has many natural language processing tools, the Cozens book cuts to the chase, explains which are the easiest tools to use, and shows you how to use them.
  • “Coherent Keyphrase Extraction via Web Mining” discusses four variants of the KEA key-phrase extraction algorithms – two baseline and two enhanced – and evaluates their performance on a set of computer science and physics research papers. The enhanced versions use statistical association based on web term/phrase frequencies to “weed out” key-phrases that are unlikely to be associated with the text, and they improve on the baseline by generating a higher percentage of key-phrases that match author-generated key-phrases.
  • “Narrative Text Classification for Automatic Key Phrase Extraction in Web Document Corpora” compares three key-phrase extraction algorithms (TF.IDF, KEA, and Keyterm) for summarizing web-pages on websites. They find that Keyterm performs best in tests with users. KEA is a keyphrase extraction package, and another one is Lingpipe.
  • Paper from MSFT Research for clustering algorithms to find user-verifable “related” phrases within search results — could in theory be applied to any list of documents, “Learning to Cluster Web Search Results”
  • Autonomy’s Clustering for Enterprise.

Human-Like Memory Capabilities

June 18, 2008

Human-Like Memory Capabilities by Scott Fahlman, June 17, 2008

My interpretation of what he is saying is that he is looking to build an artificial memory system that can

  1. build-up new complex concepts/facts from incoming knowledge/information
  2. cross-check any given input against known facts
  3. “route” to the relevant fact(s) in response to any new situation — (i’ve always wondered if there is a connection to routing on a graph).

All this happening automatically and rapidly in real-time by taking advantage of massive parallelism built up from millisecond circuits, just like the human brain does, not needing the GHz circuits of today’s microprocessors.

A friend of mine asked me, but isn’t this what exactly what Google is?

Maybe Google includes a subset of this list. It indexes incoming knowledge (facts) and makes them searchable in response to a human-defined query. Still i see some differences which I outline below… See a related blog post “So What’s the Google End-Game?“about Google and artificial intelligence that quotes the Atlantic Monthly article “Is Google Making Us Stupid?

First, the ability to specify the query in real-time, in real-life situations. Google or machines can’t do that, only humans can at this point. Second is low search efficiency relative to human memory. Although Google may be the most comprehensive and best search engine in the world today, it still requires a lot of human interpretation to use it and refine queries through multiple searches based on initial search results returned — as an example, I’m picturing all the effort needed to do searches for scientific papers and content. Since we end up having to do many many searches the “search” efficiency is not very high compared to human thought which appears to be near-instantaneous among our store of facts — that too it uses millisecond circuitry compared to GHz microprocessing.

Google search may be a machine, but at the heart of it all are associations and judgments originally created by humans in at least two ways. PageRank uses number and prominence of hyperlinks that point to pages as its metric (collaborative filtering) — the more the better. See “On the Origins of Google”

… the act of linking one page to another required conscious effort, which in turn was evidence of human judgment about the link’s destination.

Another area is Bayesian association of “related” keywords (ex. “nuclear” is related to “radioactive”) based on mining human-generated content. See “Using large data sets”. These associations are input by humans on the web, and merely computed/indexed by Google. Like Google, to some degree it’s possible that people communicate with each other to learn and form their own relevance/judgement. I don’t think that explains 100% of how human memory works.

There must be something else based on a human personal experience with the world — like the way babies learn by putting everything in their mouth — that can bootstrap human memory to turn it into what it ends up becoming. Is it logic, association, or something else? I think that what’s missing in today’s machines memories — Google included.

This sums it up… See page 149, “Advanced Perl Programming,” by Simon Cozens

“Sean Burke, author of Perl and LWP and a professional linguist, once described artificial intelligence as the study of programming situations where you either don’t know what you want to don’t know how to get it.”

Millennium Technology Prize

June 11, 2008

Millennium Technology Prize Awarded to Professor Robert Langer for Intelligent Drug Delivery

The youtube video on this site interviewing Dr. Langer is cool… he shows how drug delivery via polymers is now leading to precise targeting of drugs down to the unicellular level and enabling release of drugs controlled by human-embedded microprocessors.

In choosing his career in 1974, he blew off the oil companies and talks about how his first boss liked to hire unusual people. He invented 200 ways that it didn’t work for every 1-2 successful ways that did.

Fermi’s nobel lecture (1938)

June 3, 2008

The simplicity of Fermi’s nobel lecture (1938) is stunning — the implications of this work changed history forever. Other nobel lectures i’ve read go on and on — this lecture is only 8 pages. Fermi also cites and gives credit to a dozens of other researchers upon whose work his discoveries are based. He explains the discovery of radioactivity caused by neutron bombardment and study of interactions of “thermal” neutrons with all the elements, including uranium and thorium.

p. 415,

The small dimensions, the perfect steadiness and the utmost simplicity are, however, sometimes very useful features of the radon + beryllium sources.


His experiments involve neutron sources, paraffin wax, and spinning wheels, not complicated particle accelerators or machinery. Anyone with a freshman-level chemistry/physics knowledge should be able to understand the lecture, but even that is not absolutely needed.

Why we love — the actual chemistry

January 14, 2008

there is a book on this called “Why We Love” by helen fisher. see “First flush of love not emotional”

It’s based on research/experiments with hundreds of subjects over several years involving psychological surveys, fMRI, blood tests, and the like. Also seems this applies to mammals of all sorts. Fun to read all the way through.

  1. Three stages (lust, love, attachment)
  2. Lust is characterized by sexual attraction to the opposite sex, and driven primarily by testosterone regardless of gender
  3. Love is defined by focus/obsession with one mate in particular, mediated by dopamine and norephrenine. when the “mate” is not present the rise in dopamine/norephrenine causes a drop in serotonin, just like an addiction/depression.
  4. Attachment is creates the sense of peace/solace/trust in a long term relationship and occurs primarily due to oxytocin in females and vasopressin in males.

According to the book there are interactions between these three effects (for example attachment may suppress lust), with evolutionary explanations.

Neuroeconomic research shows that expressions of “trust” increase oxytocin in people (men & women), while expressions of mistrust create a rise in testosterone only in men (not women).

Update on 9/2/2008: “‘Bonding Gene’ Could Help Men Stay Married”

A study of Swedish twin brothers found that differences in a gene modulating the hormone vasopressin were strongly tied to how well each man fared in marriage.

“Our main finding was an association between a variant of the vasopressin receptor 1a gene and how strong bonds men reported they had to their partners,” said lead researcher Hasse Walum, of the department of medical epidemiology and biostatistics at the Karolinska Institute in Stockholm. “Men carrying this variant scored on average lower on a scale measuring the strength of the bond compared to men not carrying this variant.”

Women married to men carrying the “poorer bonding” form of the gene also reported “lower scores on levels of marital quality than women married to men not carrying this variant,” Walum noted.

Vasopressin activates the brain’s reward system, and “you could say that mating-induced vasopressin release motivates male voles to interact with females they have mated with,” Walum said. “This is not a sexual motivation, but rather a sort of prolonged social motivation.” In other words, the more vasopressin in the brain, the more male voles want to stick around and mingle with the female after copulation is through. This effect “is more pronounced in the monogamous voles,” Walum noted.

They found that men with a certain variant, known as an allele, of the vasopressin 1a gene, called 334, tended to score especially low on a standard psychological test called the Partner Bonding Scale. They were also less likely to be married than men carrying another form of the gene. And carrying two copies of the 334 allele doubled the odds that the men had undergone some sort of marital crisis (for example, the threat of divorce) over the past year.

Global Warming — numbers please!

December 27, 2007

A few weekends ago after getting into yet another discussion about global warming where everyone thinks someone else has the numbers, but insists that hybrids and CFLs are the way to go… I decided to put together numbers that would be useful to whip out over drinks/discussions – slides available here. They are drawn from the IPCC and a few other sources. Conclusions on slide 3.

The Amazon’s impact on climate change

December 7, 2007

The amazon forest in Brazil plays a unique role in the earth’s climate says this article “WWF says warming puts Amazon at risk.”

“The importance of the Amazon forest for the globe’s climate cannot be underplayed,” said Daniel Nepstad, author of a new report by the World Wide Fund For Nature released at the U.N. climate change conference in Bali.

“It’s not only essential for cooling the world’s temperature, but also such a large source of fresh water that it may be enough to influence some of the great ocean currents, and on top of that, it’s a massive store of carbon.”

Sprawling over 1.6 million square miles, the Amazon covers nearly 60 percent of Brazil. Largely unexplored, it contains one-fifth of the world’s fresh water and about 30 percent of the world’s plant and animal species — many still undiscovered.

“Debugging the House”

December 2, 2007

“Debugging the House: From vacuums to towels, new products for the microbe-phobic” is about different technologies to fight bacterial germs in your house from the Wall Street Journal. In particular they discuss the pros/cons of

  • Silver Ions
  • Copper Oxide
  • Steam
  • Triclosan
  • Ultraviolet Light

All that said, the article tries to balance explaining the threat with the paranoia

Many common organisms can be dangerous or even deadly: Some 3,000 germs are known to cause human illness, says John Sinnott, director of Infectious Disease and International Medicine at the University of South Florida and Tampa General Hospital.

Even so, trying to wipe out all of the bacteria in your house isn’t advisable, experts say. Science writer Jessica Snyder Sachs, author of “Good Germs, Bad Germs,” says 99.9% of all germs are harmless to humans, and some are even beneficial. “Our bodies are covered with microbes, and many protect us against the bad guys,” she says.