Technology | Sri Spot

Archive for the ‘Technology’ Category

“It is possible, in life, for everyone in the world to be wrong.”

October 28, 2008

…is what Nicholas Negroponte used to tell his son. On October 13, I got to hear Nicholas speak for the MIT Club of Northern California about his latest project called One Laptop Per Child. As director of the MIT Media Lab, Nicholas founded for-profit companies such as Wired Magazine and went against the advice of everyone. He made OLPC into a non-profit so that he could achieve clarity of purpose that would enable partnerships with organizations and governments around the world that a for-profit could not achieve:

To create educational opportunities for the world’s poorest children by providing each child with a rugged, low-cost, low-power, connected laptop with content and software designed for collaborative, joyful, self-empowered learning.

By doing so, he found that they could attract the best industry talent to OLPC that they could never afford to pay for and OLPC is at 23 full-time-equivalents (FTEs). OLPC is launching these laptops to children in 31 countries around the world (like Rwanda, Mongolia, Haiti, Peru, Uraguay, India, …) with software and lessons in local languages. Prodction is at 100K laptops per month, and their eventual target market is to reach 500 million children worldwide who lack access to computers.

These computers have a mesh Wi-Fi network interface which is connected to a satellite hookup to the Internet where available — so in addition to Internet access, the laptops can automatically check-in to a central server in each country to track their status. If my memory serves me correct in some countries like Peru and Rwanda he reported 100% of the laptops check-in, in others they have not seen a large fraction of the laptops.

The principles driving their design and deployment of these laptops are:

child ownership — low cost laptops designed with a child in mind
low ages (6-14) — view children as a market
saturation — access for everyone not just a few computers in each school
connection — to the Internet
free and open-source software

The idea for this started out when Negroponte and his colleagues at MIT recognized that computer programming was the closest we get to thinking about thinking, and when computers are out in the hands of children in developing countries they can learn something not possible with any other means. It brings the learning experience alive for children and teachers even in traditional classrooms.

Going forward, scaling is their biggest organizational challenge. Technically they need to increase the level of integration for the hundreds of components in their laptop to bring down cost. They do not see themselves as being the primary provider of content (lessons, applications) — the analogy he used was that Gutenberg didn’t write the books. He says they don’t compete with Microsoft/Intel in the same way that the UN World Food Program doesn’t compete with McDonalds.

This holiday season OLPC is introducing a Give-One-Get-One program to allow people to buy an OLPC laptop that pays for (donates) a laptop to a child in a developing country. For their evential target of 500 million children, they need $50 billion (500 x $100 million) to saturate the world with laptops. I think coming up with that amount of cash will eventually become the limit to OLPC’s scaling unless they find some other source of funding through government or private foundations.

Nicholas started off his talk by telling a story on how he got funding for starting the MIT Media Lab that he said was never told before. The MIT president at the time was nearing the end of his term and decided he wanted to do research instead of moving up to Chairman of the MIT board. So Nicholas decided this was an opportunity to start a lab, but needed some funding to do so — where to start? He called up the chairman of NEC saying that others in France would invest provided the chairman invests. Once the chairman decided to invest, he said the same trick on his other investors and so he ended up with the funding he needed.

Like Nicholas, many great people I have admired have at one time or another said something like what Feynman did — “What do you care what other people think?” Investors like Buffet, Soros, and Trump also say something similar about being contratian. Here is a link to a video from 1984 where Negroponte describes his visions for the future.

update on 5/2, see “Einstein the nobody” and “The World As I See It”

Posted in Internet, Marketing, People, Rural Development and Economic Growth, Technology | 2 Comments »

High-quality personal filtering

August 25, 2008

Imagine a personal filter that connects people to web-content and other people based on interests and memories they uniquely have in common – enabling discovery of new knowledge and personal connections. High quality matches offering freedom from the chore of reading every item so you don’t worry about missing something you care about in your email, on the web, or in a social network.

Background

There are 1.3 billion people on the web and over 100 million active websites. The Internet’s universe of information and people, both published and addressed to the user is growing every day. Published content includes web pages, news sources, RSS feeds, social networking profiles, blog postings, job sites, classified ads, and other user generated content like reviews. Email (both legitimate and spam), text messages, newspapers, subscriptions, etc. are addressed directly to the user.

The growth of Internet users and competition among publishers is leading to a backlog or heap of hundreds or thousands of unread email, rss, and web content in user inboxes/readers — forcing users to settle somewhere between the extremes of either reading the all the items or starting fresh (as in “email bankruptcy”) .

Search is not enough. If you don’t have time to read it, a personal filter reads it for you. If search is like a fishing line useful for finding what you want right now, a personal filter is a fishing net to help you capture Internet content tailored to your persistent interests — for which it is either painful or inefficient to repeatedly search and get high quality results on an hourly or daily basis.

Techniques

Personal filtering can help users recover the peace of mind that they won’t miss an item important either in the email inbox, on the web, or in a social network — provided matching algorithms deliver sufficient signal-to-noise ratio (SNR). For example, an alert for “Devabhaktuni Srikrishna” produces relevant results today because it is specific – i.e. SNR is high. Only 1:100 or so alerts for “natural language processing” or “nuclear terrorism” end up being relevant to me. Signal is low, noise is high. So SNR is low but can be improved through knowledge of personal “context” gained via emails, similar articles, etc. Successful/relevant matches are all about context – what uniquely describes a user, and can we find it in other news or users?

The relevance of a match should be easily/instantly verifiable by the user.
Statistical analysis of phrases is proven to be a useful method to extract meaningful signal from the noise and find precise matches.
Phrases describing each person are in what users write, read, and specify.

Framework

Let’s define the user corpus to be all content describing users, which includes the user’s own email, blogs, bookmarks, and social profiles. Interests can also be derived from items or links the user decides to upload or submits specifically by email – “like this.” As new information arrives the most likely and statistically useful interests are automatically detected through natural language processing algorithms (aka text mining), similar to Amazon’s Statistically Improbable Phrases (SIPs). The interests are displayed on a website analogous to friend lists in social networking, the interest list is automatically ranked, and the user vets the list to keep only the most meaningful interests.

Once a user’s interest list is generated, it can be used to filter all information in the entire web corpus via open search engine APIs (such as Yahoo-BOSS and Live Search API 2.0) to create a personal search index. The filter may also be aimed at a subset of the web which we call the user-targeted corpus configured directly by the user – including incoming email, alerts, RSS feeds, a specific news source (like the Wall Street Journal), social networking profiles, or any information source chosen by the user. There are now open-standard APIs to email (like IMAP for Gmail and Yahoo Mail Apps), social networks (like Facebook API and Opensocial for Linkedin, Hi5, MySpace, and other social networking sites), and newspapers like the New York Times.

The result: A manageable number of high-quality filtered matches, ranked and displayed for the user, both viewable on the website and in the form of a periodic email (daily update).

The Personal Filtering Space

References statistical analysis of phrases

Chapter 5, “Collocations,” Foundations of Statistical Natural Language Processing (Hardcover) by Christopher D. Manning , Hinrich Schuetze
Church and Hanks “Word Association Norms, Mutual Information, and Lexicography” and “Using statistics in lexical analysis.”
Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.
Text Mining Application Programming by Manu Konchady
Phrases, Natural Language Processing, and Artificial Intelligence by Devabhaktuni Srikrishna

Personalized news/RSS filtering

“Recommended News” in Google News, Yahoo’s personalized news portal, My Hakia, Technorati, Newsfire, Digg, Socialmedian, Alerts.com, Dailyme, Loud3r, Feedhub, the personal portals from WSJ/NYTimes/WashPost, NewsGator/RivalMap, AideRSS, DayLife, Reddit, Feeds 2.0, Twine, Feedrinse, and Particls, Filtrbox, Meehive, Pressflip, Factiva by Dow-Jones, HiveFire (see blog post and demo), Sprout, my6sense (see video and review), iDiscover (iPhone app), Skygrid for mining financial information in blogs (see article), Uptodate and JournalWatch for tracking medical research by specialty, Anchora.

Content-to-content matching

Sphere, Angstro (mining the web for mentions of your friends using their social profiles), Pique by Aggregate Knowledge, Loomia, MashLogic (see TechCrunch writeup) and Zemanta search for and display “related” content on blog posts and news articles currently browsed — analogous to how Google AdSense displays contextually relevant ads, except these services serve up blog/news/other content. Inside Gmail, Google displays related content links below their ads under “More about…”

Person-to-person matching

EHarmony, Twine, Redux, Searchles, PeopleJar, and several Facebook apps have tried to do things like this FriendAnalyzer.

Persistent search

Factiva Alerts by Dow-Jones, Google Alerts, Filtrbox (see Mashable interview), Yotify (see Mashable writeup), CleanOffer for real estate listings, Everyblock for news stories about your own block through RSS or email, Trovix searches jobs on Monster.com, and Oodle searches multiple classified sites. Operating systems and email clients have had a user-configured “Virtual folder” feature for some time – configured by the user (manual), or auto-generated whenever a user performs a search (over-active). , Pressflip for news sources like Reuters, UPI, etc. There is also a way to do Persistent search in Gmail

Some examples of academic/research projects

Creating hierarchical user profiles using Wikipedia by HP-Labs, Interest-Based Personalized Search (Utah), The CMU Webmate, UIUC’s “AN INTELLIGENT ADAPTIVE NEWS FILTERING SYSTEM”, “Hermes: Intelligent Multilingual News Filtering based on Language Engineering for Advanced User Profiling (2002)”. Fido from Norway –“A Personalized RSS News Filtering Agent”. HP researchers compared three ways of detecting tags in “Adaptive User Profiles for Enterprise Information Access“ using del.ico.us bookmark collections. “Writing a Personal Link Recommendation Engine” in Python Magazine demonstrates a web document ranking system based on content in del.ico.us bookmarks.

Natural Language Processing & Search

Cognition, Hakia, and Powerset (acquired by Microsoft). Google personalization incorporates Personalized PageRank and uses Bayesian Word Clustering to find related keywords — see “Introduction To GoogleRanking” and “Technologies Behind Google Ranking“.

Collaborative Filtering

Digg, Socialmedian, StumbleUpon, FastForward by BuzzBox, Amazon.com (for books), “Recommended News” in Google News, Twine, Filtrbox.

Personal email analysis and indexing

Xobni, Xoopit, ClearContext, Email Prioritizer by Microsoft Research (see writeups here and here)

Personal Music Filtering

Pandora

Recommender Systems

See the list of recommender systems in Wikipedia.

Bill Burnham writes,

…no one has yet to put together an end-to-end Persistent Search offering that enables consumer-friendly, comprehensive, real-time, automatic updates across multiple distribution channels at a viable cost.

Personalized Clustering: It’s too hard, say developers

Because let’s face it, Personalization + Clustering is the next big step in RSS. If 2005 was about Aggregation, then 2006 is all about Filtering.
Nik wrote up his thoughts today, in a post entitled Memetracking Attempts at Old Issues. While he mentions lack of link data as being an issue, it seems to me the crux of the problem is this:

“generating a personal view of the web for each and every person is computationally expensive and thus does not scale, at all.”

He goes on to say that “this is why you don’t have personalized Google results – we just don’t have the CPU cycles to care about you.”

So it’s mainly a computational and scaling problem. Damn hardware.

I think enabling technology will be new and better algorithms. Also see the links in Filtering Services and “Why Filtering is the Next Step for Social Media”.

Web 3.0 Will Be About Reducing the Noise

…if you think it is hard enough to keep up with e-mails and instant messages, keeping up with the Web (even your little slice of it) is much worse…. I need less data, not more data.

Bringing all of this Web messaging and activity together in one place doesn’t really help. It reminds me of a comment ThisNext CEO Gordon Gould made to me earlier this week when he predicted that Web 3.0 will be about reducing the noise. (Some say it will be about the semantic Web, but those two ideas are not mutually exclusive). I hope Gould is right, because what we really need are better filters.

I need to know what is important, and I don’t have time to sift through thousands of Tweets and Friendfeed messages and blog posts and emails and IMs a day to find the five things that I really need to know. People like Mike and Robert can do that, but they are weird, and even they have their limits. So where is the startup that is going to be my information filter? I am aware of a few companies working on this problem, but I have yet to see one that has solved it in a compelling way. Can someone please do this for me? Please? I need help. We all do.”

The Music Genome Project powers Pandora

The Music Genome Project, created in January 2000, is an effort founded by Will Glaser, Jon Kraft, and Tim Westergren to “capture the essence of music at the fundamental level” using over 400 attributes to describe songs and a complex mathematical algorithm to organize them. The company Savage Beast Technologies was formed to run the project.

A given song is represented by a vector containing approximately 150 genes. Each gene corresponds to a characteristic of the music, for example, gender of lead vocalist, level of distortion on the electric guitar, type of background vocals, etc. Rock and pop songs have 150 genes, rap songs have 350, and jazz songs have approximately 400. Other genres of music, such as world and classical, have 300-500 genes. The system depends on a sufficient number of genes to render useful results. Each gene is assigned a number between 1 and 5, and fractional values are allowed but are limited to half integers.^[1] (The term genome is borrowed from genetics.)

Given the vector of one or more songs, a list of other similar songs is constructed using a distance function.

To create a song’s genome, it is analyzed by a musician in a process that takes 20 to 30 minutes per song. Ten percent of songs are analyzed by more than one technician to ensure conformity with the standards, i.e., reliability.

The technology is currently used by Pandora to play music for Internet users based on their preferences. (Because of licensing restrictions, Pandora is available only to users whose location is reported to be in the USA by Pandora’s geolocation software).^[2]

How to Extract a Webpage’s Main Article Content

I had an idea to make a personalized news feed reader. Basically, I’d register a bunch of feeds with the application, and rate a few stories as either “good” or “bad”. The application would then use my ratings and the article text to generate a statistical model, apply that model to future articles, and only recommend those it predicted I would rate as “good”. It sounded like a plausible idea. I decided to start a pet project.

I soon learned that this idea wasn’t original, and in fact had been attempted by quite a few companies. The first to seriously implement this idea was Findory, later followed by Thoof, Tiinker, Persai, and probably others I’m not aware of. As of this writing, only Persai is still in business. Apparently, personalized news feeds aren’t terribly profitable. Why they’re not a commercial hit is a whole article in itself, so I won’t go into it now. However, before I admitted to myself that this project was doomed to failure, I decided to implement a few components to get a better feel for how the system would work. This is a review of a few interesting things I learned along the way.

TechCrunch writes about PeopleJar

The site’s most powerful feature is its robust search function, which allows users to search for others using many criteria. After creating a search, users can choose to have the site persistently monitor for any matches in the future.

Robert Pasarella writes about Tools for the Equity Research Toolbox

One of the paramount abilities of a good analyst is to spot trends early and realize their potential impact on a company or industry. What analysts are usually searching for is any hint of weakness or strength in competitive advantage. Sometimes the smallest trends start in the local newspapers. Google News makes locating those topics and stories much easier.

If you pair Google News with the enhanced filtering ability of Yahoo! Pipes, and your favorite feed reader; you can create some worthwhile tools that help your trend seeking abilities.

Here is an example I’ve been working on as part of a wider range of investment ideas on Oil.

My first approach was to set up a search in Google News that highlighted anytime OIL was in the title of a story. You can do that with the ‘allintitle’ operator and since I wanted US based sources I added the ‘location’ operator with USA as the source- it looks like this in the Google search window.
http://news.google.com/news?hl=en&ned=us&q=allintitle:oil+location:USA&ie=UTF-8&scoring=n

To see more useful operators check out the Google Cheat Sheet

There are choices on the page to make this search into an RSS feed. Clicking a link on the page will create a feed url in either RSS 2.0 or Atom. You can then take that feed and do further refining in Yahoo! Pipes. I like to create a broad search from Google News and then apply a layer of filters in Pipes for key terms that I think are important. Once I have configured Pipes to my liking it becomes a feed for my RSS reader. I also created a pipe that looks at the opinion and editorial feeds from certain newspapers. Those in the analyst community will recognize this technique as a kin to using Google alerts. Using RSS is the better mousetrap and it doesn’t clog you mailbox.

Cull Web Content With Alerts, by By KATHERINE BOEHRET, Wall Street Journal

For years, I’ve used Google Alerts as a way of keeping track of myself online. If my name is mentioned in a blog or if this column appears on the Web, such as on the site of a newspaper that syndicates it, a Google Alert sends me an email about it. Google Alerts can work for you to find a variety of things, such as telling you if a video of a favorite band popped up online or that a blogger posted something about last night’s episode of “Mad Men.”

In about a month, Google will begin delivering these alerts to users via feeds, as well as emails. Google certainly isn’t alone in the alerts arena, as Yahoo, Microsoft and AOL are also players. This week I tried two small companies that recently joined the mission to help users find the Web content using alerts.

I tried Alerts.com and Yotify.com, and found worthwhile features in both. While Google Alerts does a good job of finding search terms in news, blogs and videos, Alerts.com and Yotify use forms that are a cinch to fill out and let you pinpoint your searches.

Web 3.0: Not Yet by Navdeep Manaktala from New Delhi,

Major aspects of Web 2.0–vertical search, commerce, community, professional and user-generated content–have been panning out nicely, albeit slowly. But personalization, my cornerstone concept for Web 3.0, languishes.

I do believe, however, that we are going to move toward a more personalized and satisfying user experience within the next decade. After all, we went from Web 1.0 in 1995 to Web 2.0 in 2005. Only three years into Web 2.0, perhaps it is natural that we stay here and fine-tune for another five to seven years, before the real breakthrough innovations can come about and usher in Web 3.0.

Struggling to Evade the E-Mail Tsunami by RANDALL STROSS of the New York Times,

E-MAIL has become the bane of some people’s professional lives. Michael Arrington, the founder of TechCrunch, a blog covering new Internet companies, last month stared balefully at his inbox, with 2,433 unread e-mail messages, not counting 721 messages awaiting his attention in Facebook. Mr. Arrington might be tempted to purge his inbox and start afresh – the phrase “e-mail bankruptcy” has been with us since at least 2002. But he declares e-mail bankruptcy regularly, to no avail. New messages swiftly replace those that are deleted unread.
When Mr. Arrington wrote a post about the persistent problem of e-mail overload and the opportunity for an entrepreneur to devise a solution, almost 200 comments were posted within two days. Some start-up companies were mentioned favorably, like ClearContext (sorts Outlohok inbox messages by imputed importance), Xobni (offers a full communications history within Outlook for every sender, as well as very fast searching), Boxbe (restricts incoming e-mail if the sender is not known), and RapidReader (displays e-mail messages, a single word at a time, for accelerated reading speeds that can reach up to 950 words a minute). But none of these services really eliminates the problem of e-mail overload because none helps us prepare replies. And a recurring theme in many comments was that Mr. Arrington was blind to the simplest solution: a secretary.

When Mr. Arrington wrote a post about the persistent problem of e-mail overload and the opportunity for an entrepreneur to devise a solution, almost 200 comments were posted within two days. Some start-up companies were mentioned favorably, like ClearContext (sorts Outlook inbox messages by imputed importance), Xobni (offers a full communications history within Outlook for every sender, as well as very fast searching), Boxbe (restricts incoming e-mail if the sender is not known), and RapidReader (displays e-mail messages, a single word at a time, for accelerated reading speeds that can reach up to 950 words a minute).

Sarah Perez of ReadWrite Web proposes five solutions to email overload,

get it done
4-hour work week
email as sms
folders/rules
email bankruptcy

In Email Hell, Ross Mayfield writes in Forbes

E-mail overload is the leading cause of preventable productivity loss in organizations today. Basex Research recently estimated that businesses lose $650 billion annually in productivity due to unnecessary e-mail interruptions. And the average number of corporate e-mails sent and received per person per day are expected to reach over 228 by 2010.

Clint Boulton writes in “Study: Collaboration Overload Costs U.S. $588B a Year “

“Information Overload: We Have Met the Enemy and He is Us,” authored by Basex analysts Jonathan B. Spira and David M. Goldes and released Dec. 19, claims that interruptions from phone calls, e-mails and instant messages eat up 28 percent of a knowledge worker’s work day, resulting in 28 billion hours of lost productivity a year. The $588 billion figure assumes a salary of $21 per hour for knowledge workers.
The addition of new collaboration layers force the technologies into untenable competitive positions, with phone calls, e-mails, instant messaging and blog-reading all vying for workers’ time.
For example, a user who has started relying on instant messaging to communicate may not comb through his or her e-mail with the same diligence. Or, a workgroup may add a wiki to communicate with coworkers, adding another layer of collaboration and therefore another interruption source that takes users away from their primary tasks.
Beyond the interruptions and competitive pressure, the different modes of collaboration have created more locations through which people can store data. This makes it harder for users to find information, prompting users to “reinvent the wheel because information cannot be found,” Basex said.
Basex’ conclusion is that the more information we have, the more we generate, making it harder to manage.

In “The threat from within,” COL. PETER R. MARKSTEINER writes,

If a technological or biological weapon were devised that could render tens of thousands of Defense Department knowledge workers incapable of focusing their attention on cognitive tasks for more than 10 minutes at a time, joint military doctrine would clearly define the weapon as a threat to national security.
Indeed, according to the principles of network attack under Joint Publication 3-13, “Information Operations (IO),” anything that degrades or denies information or the way information is processed and acted upon constitutes an IO threat. That same publication cautions military leaders to be ever-vigilant in protecting against evolving technologically based threats. Yet throughout the Defense Department and the federal government, the inefficient and undisciplined use of technology by the very people technology was supposed to benefit is degrading the quality of decision-making and hobbling the cognitive dimension of the information environment.
We all receive too much e-mail. According to the Radacati Research Group, roughly 541 million knowledge workers worldwide rely on e-mail to conduct business, with corporate users sending and receiving an average of 133 messages per day – and rising. While no open-source studies address how the Defense Department’s e-mail volume compares to corporate users’, my own anecdotal experience and that of legions of colleagues suggests a striking similarity. Without fail, they report struggling every day to keep up with an e-mail inbox bloated with either poorly organized slivers of useful data points that must be sifted like needles from stacks of nonvalue-adding informational hay or messages that are completely unrelated to any mission-furthering purpose.
E-mail is a poor tool for communicating complex ideas. Text-only communication, or “lean media,” as it is referred to by researchers who study the comparatively new field of computer mediated communication, lacks the nonverbal cues, such as facial expression, body language, vocal tone and tempo, that inform richer means of communication. Moreover, aside from its qualitative shortcomings and viral-like reproductive capacity, a growing body of research suggests e-mail’s interruptive nature is perhaps the most pressing threat to decision-making in the cognitive dimension.

In the Future of Search Won’t Be Incremental, Adam DuVander writes,

Personalization isn’t only coming, it’s here. Sign in to your Google account and you can activate it. Prepare to be underwhelmed. But even if it were as Carrasco describes, privacy concerns would stop personalized search from being adopted until the benefits were undeniable. It would take a radical shift.

When Google came along, it provided something that had never been seen before: good search results. Unlike all the other search engines, Google’s top few slots had what we were looking for. And it provided them fast.

It was a much easier time to make big changes. Someone has to make us realize that Google’s results are as antiquated as Yahoo and Excite were in the late 90s. A change in interface might be the most likely innovation.

Sphere which was acquired by AOL News displays articles “related” to the content of the page currently viewed by the user, and now powers over 100,000 sites including many major news outlets like Wall Street Journal, Time, Reuters, etc. This is also the back-end service (see content widget) used to generate “possibly related posts” on WordPress.

Sphere’s founder explains why they created it,

“We founded Sphere with a mission to make contextually relevant connections between all forms of content (mainstream media articles, archived articles, videos, blogs, photos, ads) that enable the reader to go deep on topics of interest,” wrote Conrad.

At the time of its acquisition, Sphere reaches a large number of webpages

Sphere’s third-party network includes more than 50,000 content publishers and blogs and is live on an average of more than 2 billion article pages across the web every month.*

Om Malik writes about Spheres original concept for Blog Search,

The way Sphere works is a combination of many tracks. Lets use an example say of what else, Broadband. The look for blogs that write about broadband, (including those with broadband in the title of the blog) to create a short list. If I am linking to someone who is also a broadband blogger, and vice-versa, Sphere puts a lot of value on that relationship. The fact is most of us broadband bloggers tend to debate with each others. Think Blog Rank, Instead of Google’s Page Rank. The company has also taken a few steps to out-smart the spammers, and tend to push what seems like spam-blog way down the page. Not censuring but bringing up relevant content first. They have pronoun checker. Too many I’s could mean a personal blog, with less focused information. That has an impact on how the results show up on the page.

John Batelle on how Sphere works,

It pays attention to the ecology of relationships between blogs, for example, and it gives a higher weighted value to links that have more authority. This will insure, for example, that when a Searchblog author goes off topic and rants about, say, Jet Blue, that that author’s rant will probably not rank as high for “Jet Blue” as would a reputable blogger who regularly writes about travel, even if that Searchblog author has a lot of high-PageRank links into his site. Sphere also looks at metadata about a blog to inform its ranking – how often does the author post, how long are the posts, how many links on average does a post get? Sphere surfaces this information in its UI, I have to say, it was something to see that each Searchblog post gets an average of 21 links to it. Cool!

Melodee Patterson writes,

Last week I was on vacation without Internet access. Now that REALLY slowed down my infomaniac impulses! I had to settle for flipping through the stack of magazines that I had brought with me, and reading the ebooks that I had downloaded previously and stored on my hard drive. (Thank God I had thought ahead!)

While on vacation, I did have time to think about how much time I did waste on aimless browsing and unfocused research. So I created a list of changes that I’m going to implement this week. That includes:

removing any Google alerts that have not provided themselves useful over the past 6 months

deleting any RSS feeds that have not added to my knowledge or imagination

hitting “unsubscribe” to ezines that I don’t really read

using a timer to keep my Internet rovings to 15 minutes (unfortunately, it has a 7 minute snooze button)

Posted in Artificial Intelligence, Internet, Statistical Natural Language Processing, Technology, Trust | 4 Comments »

50 Tons of Highly Enriched Uranium in 40 countries

August 4, 2008

According to presidential candidate Obama, 50 tons of loosely guarded highly enriched uranium (HEU) remains in 40 countries around the world and he wants to negotiate agreements to eliminate them in four years. A laudable goal, and how will we go about doing this? Will all these other countries willingly give up their HEU stocks? Will they decommission or convert their HEU-based operating reactors?

In this 1-page proposal, I explore how a single, fixed-price for HEU might solve the problem once and for all.

Posted in National Security, Science, Technology, Trust | Leave a Comment »

Phrases, natural language processing, and artificial intelligence

July 3, 2008

Michael Arrington’s long interview with Barney Pell and the Microsoft product manager for search after announcing the Powerset acquisition gives hints as to the Powerset’s capabilities that MSFT search needs most vis-a-vis what Google already has incorporated into their search. A lot of it seems to fall into the “phrase” statistics category instead of “linguistic” understanding category — see below but also see this discussion about Powerset and other players like Hakia in semantic search. Given the performance of today’s mainstream search engines from Google/MSFT/Yahoo, I’m beginning to think that until the day we have artificial intelligence, phrases (bigrams, trigrams, etc) and related statistical techniques will remain the silver bullet for general purpose web search.

The Powerset/MSFT folks have a BHAG (big-hairy-audacious-goal) and it’s hard to fault them for that. The technical challenge that Powerset is running up against is the same sort of wall (i.e. steep learning curve) that shows up in modern computer vision and image recognition applied to things like image-search engines or face-recognition in photo collections. The point here is that the technology works in well-defined problem domains or with sufficient training/hints, and it is also massively parallelizable on today’s low-cost server farms for specific tasks — it either fails miserably or has degraded/unreliable performance when the software/code is not programmed to deal with increased scene or corpus complexity, i.e. combinatorial explosion of inputs that arises with diverse features. This happens when the scene or corpus is noisy and when scene/corpus features are ambiguous or new — exactly the situations where human beings still excel in making sense of the world through judgments, and why humans still write software. For example, see these papers from Stanford computer science class projects (1 and 2) that show how unreliable the face-recognition step can be for auto-tagging Facebook photo collections and in-turn how training can be used to achieve “100% Accuracy in Automatic Face Recognition.”

First of Powerset’s capabilities that MSFT wants is word-sense disambiguation — both inside user queries and in the corpus text — to produce more precise search results. Thhe example he gives is “framed.” Two attached papers listed below from the 1990s talk about two different approaches using statistical collocations (Yarowsky) and latent semantic indexing (Schuetze)… this is something MSFT may not be doing yet today, but Google is doing using perhaps Bayesian word clusters and other techniques.

RN: I think everything that Barney said was right on. I think you see search engines including Live Search and also Google and Yahoo are starting to do more work on this matching not exactly what the user entered but it is usually limited to very simple things. So now all of us do some expansion of abbreviations or expansion of acronyms. If you type “NYC” in a search engine these days, in the last couple years, it understands that it means the same thing as New York. These are very very simple rules based things, and no one understands that bark has one meaning if it about a tree and a different meaning if it is about a dog. Or an example that someone gave the other day was the question of “was so and so framed.” And framed could mean a framed picture or it could mean set-up for criminal activity that did not occur, and so on. And you have to actually understand something of it is a person’s name then it applies to one sense of the word framed if it is not then it doesn’t. So one of the things that Powerset brings that is unique is the ability to apply their search technology to the query to the user’s search in ways that are beyond just the simple pluralization or adding an “-ing” is that Powerset also looks at the document, it looks at the words that are on a web page and this is actually very important. If you look at just the users query, what you have available to you to figure what they are talking about are three words four words five words, maybe even less. That can give you certain hints. If you look at a web page that has hundreds or thousands of words on it you have a lot more information you can use if you understand it linguistically to tell what its about, what kind of quieries it should match and what kind of quieries it shouldn’t match. And Powerset is fairly unique in applying this technology in the index on a fairly large scale already and with Microsoft’s investment and long term commitment we can scale this out even further, an apply it even more of the web, not just the wikipedia content they have thus far.

Second is the potential to support complex but realistic queries in the future — they suggest that this can replace vertical search engines which seems a bit far out.

RN: I think what Peter Norwig is saying has some degree of accuracy and that he is also ignoring some things. So, just for normal queries, queries that are not phrased as questions, there is a lot of linguistic structure. If someone types in a query that is “2 bedroom apartments, under 1000 dollars, within a mile of Portero Hill.” That query is loaded with linguistic content. And that’s a realistic query. That is the type of thing that customers actually want to find on the web. Today there is a sort of helplessness, where customers know that certain queries are too complicated, and they wont even issure them to a search engine. They will go to some deep vertical search engine where they can enter different data into different boxes.

The “2 bedroom apartments, under 1000 dollars, within a mile of Portero Hill” example is interesting, since it requires logic and can’t be solved just using related keywords. Even with access to infinite compute cycles I don’t expect any general purpose search engine applied to the web corpus can do this query reliably today, including Powerset. For verticals like real-estate or travel, we will still need things like Craig’s list or Orbitz/Mobissimo that list classified ads and or build on top of user submissions, metadata, and targeted crawling.

Compared to phrases, linguistic analysis helps provide a better answer to the question of whether a given web page matches the user’s intended query or not. It certainly has the potential to improve recall through more accurate query interpretation and retrieval of web pages from the search engine’s index. By itself linguistic analysis does not improve the precision of search results (by re-ranking) in the way that hyper-link analysis did in Google’s PageRank and other web search engines.

The third use case is query expansion to related keywords using an automatically-generated thesaurus — which I thought Google is already doing.

nce; that is not really an area that is that interesting. But some of these more complex queries really are. For example, shrub vs. tree. If I do a search for decorative shrubs for my yard, and the ideal web page has small decorative trees for my garden, it really should have matched that page and brought it up as a good result. But today Google won’t do it, Yahoo won’t do it, and Live won’t do it. So even in these normal queries there is a lot of value in the linguistics.

I didn’t understand this “decorative shrubs” example. When I try it out on Google I find a result that seems relevant to me. Furthermore using the “~” option in front of any word, Google allows you to search using “related” keywords. If you try “decorative ~shrubs” you get results containing “decorative plants” as well.

Fourth, it seems 5% of queries are linguistic queries today (versus phrase queries), and they expect that fraction to expand as MSFT’s search engine capabilities mature to incorporate linguistic understanding of the query — mostly related to facts or solving problems.

BP: Let me answer the question of does anybody actually search this way. The answer is yes, people do this. It isn’t the most common mode, but we do see that probably 5% of queries are natural language queries. These are not all queries that are phrased in complete sentences, but they are queries where the customer has issued something that has some sort of linguistic structure. Almost any query with a preposition: X and Y, A near B, attribute A of Y, etc. Those things are loaded with linguistic structure…. RN: I have a list of some natural language queries in front of me. Can we just show you some queries that our customers have actually sent to us and are random examples. The first person to see the dark side of the moon. How to get a credit card in Malaysia. Enabling system restore in group policy on domain controller. Timeline of Nvidia. How to measure for draperies. What is the difference between Mrs. and women’s sizes? Does my baby have acid reflux? I could just go on and on and I. These fit in the category that we’ve labeled that match about five percent of queries and they’re really just cases where the customer can’t think of a simpler way to express it.

Finally, the people use case seems to be one of Powerset’s most visible strengths…

BP: We return answers. We actually synthesize, so if you were to say, “What did Tom Cruise star in,” you actually get not just the movies, but the cover art for the different movies. It synthesizes multiple pieces of information to give you a whole different kind of presentation. Or, if you were just to say, “Bill Gates” you’d be given an automatically generated profile of Bill Gates, pulled across many, many articles. It’s no longer just about 10 links, although we can certainly do more relevant job (and will) of the blue links, and a better job of presenting those links.

A friend of mine tried getting answers via Google, and got quick (< 5 minutes for all the searches) and relatively accurate answers without much hassle using phrases:

“far side moon” (see section on Exploration)
“get credit card Malaysia”
“enable system restore group policy”
“Nvidia history”
“drapery measurements”
“misses vs womens”
“baby acid reflux symptoms”

Those are examples that illustrate why phrases and statistical approaches are not only “good enough” for searching the web corpus, and but they have a very high “signal-to-noise” compared to all other machine search techniques that became available to us in the last 20 years.

The root issue seems to me to be that human symbolic and image comprehension is still well beyond today’s programmable machine capabilities — which has nothing to do with compute power or specific interface/modality (language or vision) — it has much more to do with human intelligence itself that resides inside the cerebral cortex and other parts of the central nervous system. I’d be willing to bet that to achieve their BHAG and make the real quantum leaps from today’s search technology that they advertise, Powerset + MSFT will need the equivalent of artificial intelligence, not just natural language processing (or better image processing in the case of scenes).

Powerset is not the only semantic search engine. Besides Hakia, there is Cognition Technologies — a company founded by a UCLA professor — that also has their own wikipedia search http://WIKIPEDIA.cognition.com. This white paper compares their results to Powerset, query by query, showing how Cognition’s precision is higher and most of the time recall is much lower (more relevant vs unrelated hits) — of course this is a test designed by Cognition, and therefore a display of their own strengths. In one example the show Cognition returning exactly one result and Powerset returning 259. Also see Top-Down and Bottom-Up Semantics by Cognition’s CTO. Cognition’s other white paper compares them query-by-query to Google, Yahoo, Verity, and Autonomy — see section VI. In these tests, only Google has an observable “statistical” boost but doesn’t quite work as well as Cognition — more precision and higher relevant recall. In summary, Cognition achieves these results using the semantic processing and natural language parsing technology described here,

Cognition’s Semantic NLP Understands:

Word stems – the roots of words;

Words/Phrases – with individual meanings of ambiguous words and phrases listed out;

The morphological properties of each word/phrase, e.g., what type of plural does it take, what type of past tense, how does it combine with affixes like “re” and “ation”;

How to disambiguate word senses – This allows Cognition’s technology to pick the correct word meaning of ambiguous words in context;

The synonym relations between word meanings;

The ontological relations between word meanings; one can think of this as a hierarchical grouping of meanings or a gigantic “family tree of English” with mothers, daughters, and cousins;

The syntactic and semantic properties of words. This is particularly useful with verbs, for example. Cognition encodes the types of objects different verb meanings can occur with.

Also see Q-Go, a European NLP technology company for customer service applications that partners with MSFT,

The intelligent, linguistic technology-based software that Q-go develops makes sure that customers receive a direct, immediate answer to all of their questions from a company’s website. Online, in their own language, and with at least the same comprehensiveness and quality as the answers provided by call centres. Not only is this easier for organisations, it’s also faster and cheaper. Q-go’s headquarters are located in Amsterdam, and the company has four local offices in Barcelona, Madrid, Frankfurt and Zurich.

A post by Don Dodge (MSFT) reveals that the Powerset’s “semantic rules” can be applied to MSFT’s existing index of the web and therefore likely to be as much about word word co-occurence statistics + clusters as it is using linguistic logic to gain an understanding from the corpus. The example is also illustrative of how Powerset breaks down a natural language query, and the post goes onto explain how Powerset may also be useful in vertical search applications…

Powerset is using linguistics and (NLP) to better understand the meaning and context of search queries. But the real power of Powerset is applied to the search index, not the query. The index of billions of web pages is indexed in the traditional way. The big difference is in the post processing of the index. They analyze the indexed pages for “semantics”, context, meaning, similar words, and categories. They add all of this contextual meta data to the search index so that search queries can find better results.

Who is the best ballplayer of all time? Powerset breaks this query down very carefully using linguistic ontologies and all sorts of proprietary rules. For example, they know that “ballplayer” can mean Sports. Sports can be separated into categories that involve a “Ball”. Things like baseball, basketball, soccer, and football. Note that soccer does not include the word ball, yet Powerset knows this is a sport that includes a ball.

Powerset knows that “ballplayer” can mean an individual player of a sport that includes a ball. They know that “best of all time” means history, not time in the clock sense.

Finally, an alternate theory that says Powerset was acquired for contextual ad placement, not search.

For the foreseeable future, phrases and statistical approaches will probably continue to deliver the greatest signal-to-noise-ratio for machine indexing of web content in the absence of a breakthrough in artificial intelligence. The evidence is not 100% conclusive, but during the last several weeks I’ve accumulated research papers to support this hypothesis…

Part-of-speech (POS) tagging for noun phrases is the standard way to find collocations, an approach discussed here based on statistics may work better: this is based on late-1980s work by Church and Hanks “Word Association Norms, Mutual Information, and Lexicography” and “Using statistics in lexical analysis.”
For a textbook introduction, see Chapter 5, “Collocations” and Chapter 7, “Word Sense Disambiguation” in Foundations of Statistical Natural Language Processing by Christopher D. Manning, Hinrich Schuetze
By constructing a similarity measure between similar words in n-grams (n<=5) using the Wikipedia corpus, the authors of “Finding Similar RSS News Articles Using Correlation Based Phrase Matching” found that they could discover related articles and the highest accuracy was achieved for bigrams (91%) and trigrams (90%) versus unigrams (86%), 4-grams (70%) and 5-grams (60%).
The paper “More Effective Web Search Using Bigrams and Trigrams” found that phrases (bigrams and trigrams) found by analysis of the top 100 initial search results in typical2-3 word search queries (1) improved the relevance of subsequent searches that included the bigrams/trigrams (2) did even better after verifying the bigrams/trigrams with the user (using relevance feedback) and (3) improved readability of topic areas beyond the first ten results.
“A Language Model Approach to Keyphrase Extraction” by Tomokiyo compares metrics to rank phrases in terms of their informativeness, and finds derivatives of a KL-metic win over likelihood ratios – this kind of phrase-finding technology was used at Intelliseek in “BlogPulse: Automated Trend Discovery for Weblogs” and “Deriving marketing intelligence from online discussion”.
The authors of “The Use of Bigrams to Enhance Text Categorization” found that bigrams improved the quality of feature sets used in Naïve Bayes classification of documents into categories.
For a given word, assignment of the single sense per phrase containing that word that appears most commonly in a document was shown to be highly effective (90-97% accuracy) for word-sense disambiguation in “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods” by David Yarowsky – also see pages 250-253 on “Foundations of Statistical Natural Language Processing” by Manning/Schuetze.
In “Diverse Topic Phrase Extraction through Latent Semantic Analysis,” … “We propose a novel algorithm for extracting diverse topic phrases in order to provide summary for large corpora. Previous works often ignore the importance of diversity and thus extract phrases crowded on some hot topics while failing to cover other less obvious but important topics. We solve this problem through document re-weighting and phrase diversification by using latent semantic analysis (LSA).”
New word senses can be discovered based on representing co-occurrence and using SVD in “Automatic Word Sense Discrimination” by Hinrich Schutze.
Phrase-clustering of clickstream content to detect user interests (Slovakia, 2002). This book chapter (Intelligent Support for Information Retrieval in the WWW Environment, 2002) uses clustering to automatically infer/detect a user’s interests from inside the content of web-pages that the user visited – personalization derived from the clickstream content. Reports an 80% success rate in classification using unigrams, bigrams, and trigrams.

Also some papers on techniques/tools for extracting keyphrases and performance evaluations that I came across…

Google made n-gram data available for all 5-grams (or smaller) appearing more than 40 times on the web, “All Our N-gram are Belong to You”
Open-Calais by Reuters can analyze for named entities, facts, and events and their API
The chapter “Natural Language Tools” in pages 149-171 in “Advanced Perl Programming” by Simon Cozens (O’Reilly) — you get a very “quick & dirty” introduction to a number of natural language processing concepts and ways to implement and play around with them. Although Perl has many natural language processing tools, the Cozens book cuts to the chase, explains which are the easiest tools to use, and shows you how to use them.
“Coherent Keyphrase Extraction via Web Mining” discusses four variants of the KEA key-phrase extraction algorithms – two baseline and two enhanced – and evaluates their performance on a set of computer science and physics research papers. The enhanced versions use statistical association based on web term/phrase frequencies to “weed out” key-phrases that are unlikely to be associated with the text, and they improve on the baseline by generating a higher percentage of key-phrases that match author-generated key-phrases.
“Narrative Text Classification for Automatic Key Phrase Extraction in Web Document Corpora” compares three key-phrase extraction algorithms (TF.IDF, KEA, and Keyterm) for summarizing web-pages on websites. They find that Keyterm performs best in tests with users. KEA is a keyphrase extraction package, and another one is Lingpipe.
Paper from MSFT Research for clustering algorithms to find user-verifable “related” phrases within search results — could in theory be applied to any list of documents, “Learning to Cluster Web Search Results”
Autonomy’s Clustering for Enterprise.

Posted in Artificial Intelligence, Internet, Neuroscience, People, Science, Technology | 6 Comments »

Human-Like Memory Capabilities

June 18, 2008

Human-Like Memory Capabilities by Scott Fahlman, June 17, 2008

My interpretation of what he is saying is that he is looking to build an artificial memory system that can

build-up new complex concepts/facts from incoming knowledge/information
cross-check any given input against known facts
“route” to the relevant fact(s) in response to any new situation — (i’ve always wondered if there is a connection to routing on a graph).

All this happening automatically and rapidly in real-time by taking advantage of massive parallelism built up from millisecond circuits, just like the human brain does, not needing the GHz circuits of today’s microprocessors.

A friend of mine asked me, but isn’t this what exactly what Google is?

Maybe Google includes a subset of this list. It indexes incoming knowledge (facts) and makes them searchable in response to a human-defined query. Still i see some differences which I outline below… See a related blog post “So What’s the Google End-Game?“about Google and artificial intelligence that quotes the Atlantic Monthly article “Is Google Making Us Stupid?”

First, the ability to specify the query in real-time, in real-life situations. Google or machines can’t do that, only humans can at this point. Second is low search efficiency relative to human memory. Although Google may be the most comprehensive and best search engine in the world today, it still requires a lot of human interpretation to use it and refine queries through multiple searches based on initial search results returned — as an example, I’m picturing all the effort needed to do searches for scientific papers and content. Since we end up having to do many many searches the “search” efficiency is not very high compared to human thought which appears to be near-instantaneous among our store of facts — that too it uses millisecond circuitry compared to GHz microprocessing.

Google search may be a machine, but at the heart of it all are associations and judgments originally created by humans in at least two ways. PageRank uses number and prominence of hyperlinks that point to pages as its metric (collaborative filtering) — the more the better. See “On the Origins of Google”

… the act of linking one page to another required conscious effort, which in turn was evidence of human judgment about the link’s destination.

Another area is Bayesian association of “related” keywords (ex. “nuclear” is related to “radioactive”) based on mining human-generated content. See “Using large data sets”. These associations are input by humans on the web, and merely computed/indexed by Google. Like Google, to some degree it’s possible that people communicate with each other to learn and form their own relevance/judgement. I don’t think that explains 100% of how human memory works.

There must be something else based on a human personal experience with the world — like the way babies learn by putting everything in their mouth — that can bootstrap human memory to turn it into what it ends up becoming. Is it logic, association, or something else? I think that what’s missing in today’s machines memories — Google included.

This sums it up… See page 149, “Advanced Perl Programming,” by Simon Cozens

“Sean Burke, author of Perl and LWP and a professional linguist, once described artificial intelligence as the study of programming situations where you either don’t know what you want to don’t know how to get it.”

Posted in Artificial Intelligence, Internet, Neuroscience, Science, Technology, Uncategorized | 1 Comment »

Millennium Technology Prize

June 11, 2008

Millennium Technology Prize Awarded to Professor Robert Langer for Intelligent Drug Delivery

The youtube video on this site interviewing Dr. Langer is cool… he shows how drug delivery via polymers is now leading to precise targeting of drugs down to the unicellular level and enabling release of drugs controlled by human-embedded microprocessors.

In choosing his career in 1974, he blew off the oil companies and talks about how his first boss liked to hire unusual people. He invented 200 ways that it didn’t work for every 1-2 successful ways that did.

Posted in Health, People, Science, Technology | Leave a Comment »

Fermi’s nobel lecture (1938)

June 3, 2008

The simplicity of Fermi’s nobel lecture (1938) is stunning — the implications of this work changed history forever. Other nobel lectures i’ve read go on and on — this lecture is only 8 pages. Fermi also cites and gives credit to a dozens of other researchers upon whose work his discoveries are based. He explains the discovery of radioactivity caused by neutron bombardment and study of interactions of “thermal” neutrons with all the elements, including uranium and thorium.

p. 415,

The small dimensions, the perfect steadiness and the utmost simplicity are, however, sometimes very useful features of the radon + beryllium sources.

His experiments involve neutron sources, paraffin wax, and spinning wheels, not complicated particle accelerators or machinery. Anyone with a freshman-level chemistry/physics knowledge should be able to understand the lecture, but even that is not absolutely needed.

Posted in National Security, People, Science, Technology | Leave a Comment »

Book Review of “The White Man’s Burden”

April 3, 2008

On a long plane plane ride to India, I read the “The White Man’s Burden: Why the West’s Efforts to Aid the Rest Have Done So Much Ill and So Little Good ” by William Easterly – the guy that Bill Gates had a strong reaction to recently on a panel discussion in Davos earlier this year. Thanks to Mora for suggesting and lending the book to me.

From “Bill Gates Issues Call For Kinder Capitalism,” January 24, 2008, Wall Street Journal…

“If we can spend the early decades of the 21st century finding approaches that meet the needs of the poor in ways that generate profits for business, we will have found a sustainable way to reduce poverty in the world,” Mr. Gates plans to say….

To a degree, Mr. Gates’s speech is an answer to critics of rich-country efforts to help the poor. One perennial critic is Mr. Easterly, the New York University professor, whose 2006 book, “The White Man’s Burden,” found little evidence of benefit from the $2.3 trillion given in foreign aid over the past five decades.

Mr. Gates said he hated the book. His feelings surfaced in January 2007 during a Davos panel discussion with Mr. Easterly, Liberian President Ellen Johnson Sirleaf and then-World Bank chief Paul Wolfowitz. To a packed room of Davos attendees, Mr. Easterly noted that all the aid given to Africa over the years has failed to stimulate economic growth on the continent. Mr. Gates, his voice rising, snapped back that there are measures of success other than economic growth — such as rising literacy rates or lives saved through smallpox vaccines. “I don’t promise that when a kid lives it will cause a GNP increase,” he quipped. “I think life has value.”

Brushing off Mr. Gates’s comments, Mr. Easterly responds, “The vested interests in aid are so powerful they resist change and they ignore criticism. It is so good to try to help the poor but there is this feeling that [philanthropists] should be immune from criticism.”

also see Easterly’s rebuttal.

Easterly is former research economist at the World Bank now at NYU. in his book he looks at the successes/ failures of international aid interventions (financial + military) by “The West” and makes the case that they have done more harm than good during the past 50+ years.

Most of Easterly’s book makes sense to me and I agree with Easterly that philanthropic/aid agencies are not “above” criticism – their hyped up expectations do not necessarily make things better and sometimes they make things worse by standing in the way of more realistic, lasting solutions… but,

I agree with Gates on one thing, that you can get into trouble measuring national economic development using aggregate GDP (growth) instead of measuring the purchasing power of the bottom pyramid (half or quarter) of earners in the economy. Easterly cites India as a success story of development showing a chart of exponential GDP growth over 20-40 years using the Indian IT industry as an example. Despite this progress, the failure is that 50 years after Indian independence close to half of all Indians (400-500 million people) still live on less than $1 of $2 per day.
Easterly says new (niche) market creation is limited by social and legal barriers to trust and property rights and therefore must take place indigenously – there is not much we can do about it living in “The West.” I think we have not yet explored the potential of the Internet to overcome these constraints and help diversify agricultural economies (long tail). See my essay “Opening Niche Markets in Rural India using the Internet”
Style. Though his analysis is very compelling and data-driven with graphs, stories of people, case studies of developing nations, and world history, to me the title seems a bit polarizing or stuck in the past and the tone of the writing is funny bit also feels a bit sarcastic. Perhaps it is discouraging to visionaries and optimists who want to break from the past. In his book he takes aim at Bono and Jeffery Sachs’ “The End of Poverty.”

I tried to capture the main ideas of the book… sure I missed something but I think it’s mostly here.

— Top-down “planners” at large institutions like the World Bank will mobilize resources on the basis of utopian agendas and large-scale “big pushes” that attract donor governments and private institutions (in US, UK, and The West). These visions are never achieved because they lack feedback from the people (Africa, Asia, and “The Rest”) whom they are intended to benefit, i.e. the poor. On the other hand, he notes that the World Bank produces very high quality economic research.

— Unlike market-driven firms or (legitimately) elected officials the planners are accountable to donors, not the poor. planners’ jobs are not dependent on serving the poor but rather to indulge donors’ unrealistic expectations which may never materialize. The “planners” efforts do more harm than good (large part of what the book is about). The failures of these big pushes become self-fulfilling as donors redouble their efforts, bureaucracy becomes bloated, and they begin to measure progress based on volume of aid disbursed not impact on the poor. incentives of planners and poor people are not sufficiently aligned – this is called the principal-agent problem

— Bottom-up “searchers” (NGOs, entrepreneurs, profit-seeking companies) who are on “the ground” in developing countries can get direct feedback from the poor people they serve and make real impact on the their lives. they set realistic, achievable goals unlike the planners. Too little money is going to support the searchers. several case studies.

— On microfinance and microcredit,

Microcredit is not a panacea for poverty reduction that some have made it out to be after Yunis’ discovery. Some disillusionment with microcredit has already come in response to these blown-up expectations. Microcredit didn’t solve everything; it just solved one particular problem under on particular set of circumstances-the poor’s lack of access to credit except at usurious rates from moneylenders.

— Markets are a spontaneous outgrowth of social trust (for transactions) and property rights (for investment), and can’t be planned by aid agencies, foreign governments, or “out of the blue” after an invasion or removal of a dictator.

— Foreign aid has been most effective and made a large-scale impact in people’s lives for things like vaccination, health care delivery, and programs to keep kids/girls in school when compared to other areas in which results can’t be directly measured. dollar for dollar, the recent momentum to offer AIDS treatment ($1000 per person) like Bush’s $15 billion commitment of US taxpayer funds for Africa (30 million infected) is many times less cost-effective compared to preventing the spread of AIDS through condoms (600 million not infected) or even prevention of other life-threatening diseases like malaria, diarrhea, and infant mortality. the spread of AIDS could have been avoided had prevention been a bigger priority since experts have been predicting this epidemic for over a few decades. In AIDS, saving a life gets more “emotional” attention from the public than prevention of AIDS transmission which could save many more lives.

— There have been some success stories, but economic growth in developing countries has not been correlated to aid/intervention by the West. Colonialism and imperialism has resulted in long-term economic stagnation, which he offers as a case study to consider other neo-imperialistic plans to take over weak-states. His claim is that countries develop much faster and better when they are left to their own.

— National financial health has less direct impact on earnings of the poorest people, except indirectly via inflation and government subsidies to the poor.

The IMF’s approach is simple. A poor country runs out of money when its central bank runs out of dollars. The central bank needs an adequate supply of dollars for two reasons. First, so that residents of the poor country who want to buy foreign goods can change their domestic money (let’s call it pesos) into dollars. Second, so those poor-country residents, firms, or governments who owe money to foreigners can change their pesos into dollars with which to make debt repayments to their foreign creditors. What makes the central bank run out of dollars? The central bank not only holds the nation’s official supply of dollars (foreign exchange reserves), it also makes loans to the government [aside from foreign borrowing with bonds] and supplies the domestic currency for the nation’s economy. The government spends the currency [it borrows], and the pesos pass into the hands of people throughout the economy. But are people willing to hold the currency? The printing of more currency [excessive government borrowing from the Central Bank] drives down the value of currency if people spend it on the existing amount of goods – too much currency chasing too few goods… so they take the pesos back and exchange them for dollars. The effect of printing more currency that people don’t want is to run down the central bank’s dollar holdings. Too few dollars for the outstanding stock of pesos is kind of like the Titanic with too few lifeboats. The country then calls on the IMF. So the standard IMF prescription is to force contraction of central bank credit the government, which requires a reduction in the government’s budget deficit [government spending]… forces the government to do unpopular things [like cut subsidies] – disturbance of domestic politics.

— Bad governments (corruption and violent dictators) have been responsible for much of the slow growth in these countries, which are in turn caused by either a colonial past or by historical poverty itself. Foreign aid tends to prop these governments up, and in some cases private organizations working around these governments can lead to much better results.

— Loans are not necessary to balance a national budget, and the IMF’s prescriptions for foreign exchange lending to developing countries and reducing government spending can be way off. This is due to severe accounting irregularities in the books of these countries, uncertainty of how or when markets react to falling currency prices, and how they react to information in the economy (people’s behavior). Lending based on shaky foundations can lead to the self-reinforcing “debt trap” through repeated refinancing of poor countries and propping up of bad governments. the IMF does better in emerging markets, but he says it may be better off to leave the poorest countries alone.

Posted in Health, Internet, Microfinance, People, Rural Development and Economic Growth, Technology, Trust | 2 Comments »

Corporate risks + rewards of breakthrough R&D

March 10, 2008

Corning’s Biggest Bet Yet? Diesel-Filter Technologies
By SARA SILVER, March 7, 2008; Page B1, Wall Street Journal

Corning, which went public in 1945 and has a market capitalization of about $36 billion, has survived — and often thrived — in recent decades by following a playbook that Wall Street and corporate America deems outmoded. While companies like Xerox Corp. scaled back long-term research, Corning stuck with the old formula, preferring to develop novel technologies than buy them from start-ups.

An investment 25 years ago has turned Corning into the world’s largest maker of liquid-crystal-display glass used in flat-panel TVs and computers. But another wager, which made it the biggest producer of optical fiber during the 1990s, almost sank the company when the tech boom turned into a bust.

Corning Inc. has survived for 157 years by betting big on new technologies, from ruby-colored railroad signals to fiber-optic cable to flat-panel TVs. And now the glass and ceramics manufacturer is making its biggest research bet ever.

Under pressure to find its next hit, the company has spent half a billion dollars — its biggest wager yet — that tougher regulations in the U.S., Europe and Japan will boost demand for its emissions filters for diesel cars and trucks.

“This is the biggest cash hole we’ve ever been in,” says Corning President Peter Volanakis.

In Erwin, a few miles from the company’s headquarters in Corning, the glassmaker is spending $300 million to expand its research labs. There, some 1,700 scientists work on hundreds of speculative projects, from next-generation lasers to optical sensors that could speed the discovery of drugs.

Corning’s roots go back to 1851, when Amory Houghton, a 38-year-old merchant, bought a stake in a small glass company, Cate & Phillips. For most of Corning’s history, a Houghton was either chairman or chief executive. Even today, Corning, population 12,000, is very much a company town. The original Houghton family mansion, still used for company meetings, overlooks the quaint downtown, which is punctuated by a white tower from one of Corning’s original glass factories. Most senior managers have spent their entire careers at Corning.

“Culturally, they’re not afraid to invest and lose money for many years,” says UBS analyst Nikos Theodosopoulos. “That style is not American any more.”

Corning also goes against the grain in manufacturing. While it has joined the pack in moving most of its production overseas, it eschews outsourcing and continues to own and operate the 50 factories that churn out thousands of its different products.

Corning argues that retaining control of research and manufacturing is both a competitive advantage and a form of risk management. Its strategy is to keep an array of products in the pipeline and, once a market develops, to build factories to quickly produce in volumes that keep rivals from gaining traction.

But because Corning often depends heavily on a single product line for most of its profit — 92% of last year’s $2.2 billion profit came from its flat-panel-display business — it is vulnerable to downturns. Even small movements in consumer demand for or pricing of its LCD-based products can cause gyrations in its stock price. During the dot-com meltdown when the market for fiber-optic cable crashed, Corning was brought to the brink of bankruptcy and by 2003 was forced to lay off half of its workers. Today it has 25,000 employees.

Posted in Marketing, Technology, Trust, Workplace | Leave a Comment »

Making cars safely, efficiently, and high-quality — a look into organizational culture

January 29, 2008

Along with a group of MIT alums, one week ago (1/22/2008) I was fortunate to visit the automotive manufacturing plant in Fremont, CA that turns out the Toyota Corrolla, Pontiac Vibe, and Toyota Tacoma pickup truck. The NUMMI auto plant is a joint venture of GM and Toyota. In North America, for Toyota it is the most efficient plant taking 19 hours of human effort per vehicle produced and seventh overall (GM and Honda have more efficient plants) — see the article “Most efficient assembly plants” in Automotive News. In 2007 NUMMI produced 407,881 vehicles — see “There’s a new No. 1 plant: Georgetown.” I have blogged about NUMMI’s high-trust workplace, and here are some notes that another MIT alum put together in 2003.

The first thing I noticed at the entrance to the plant was a rug on the ground titled “Safety Absolutes”

Safety is the overriding priority

All accidents can and must be prevented

At NUMMI, safety is a shared responsibility

Their mission statement posted on the wall is

“Through teamwork, safely build the highest quality vehicles at the lowest possible cost to benefit our customers, team members, community, and shareholders.”

What struck me was message of social accountability and interdependence being conveyed in both the rug and mission statement.

The plant is 5.5 million square feet (118 football fields or 122 Costcos), and workplace for 5,000 “team members” (they didn’t say employees). There are also 300 temporary workers who come in to help for seasonal variations in production.

The emphasis on relationship with team members and “community” comes out at every turn in the plant. The plant has had no layoffs in its 23 year history of operation. Wages start at $20/hr and go up to $35/hr in three years. The plant has 160 “team rooms” with refrigerators, lunchrooms, and lockers. Phrases I heard included “quality, pride, teamwork, job security, benefits, pay, family, successful year, looking out for my family, winning team, all about the family.”

There are five stages (divisions) to auto manufacturing at NUMMI.

Stamping steel into body sheets — 1 million lbs of steel / day
Body / welding
Paint
Plastics
Assembly — the assembly line is 1.5 miles long producing 650 trucks / day and 900 cars / day. There hours per truck, one produced every 85 seconds.

Quality control involves random test drives and audits at the end of the production line. The quality philosophy really starts with the team members who are trusted with the authority to push a button that will raise an alert to stop the assembly line if they find a problem — an innovation from Japanese lean manufacturing. There was a time when auto plants did not allow their employees to do this. Once they raise the alert, the a red light goes on and they have 81 seconds to decide to clear the alert the before the line actually stops. Sometimes it gets cleared up within that time, and other times the line has to stop to fix the problem. After reading about it before coming to the plant, I was really curious and I actually saw it stop a few times. The line statistics are prominently displayed for team members to view on a scoreboard. They were reporting 2% downtime and the target is to remain less than 4%.

The NUMMI team members work in teams of 4-6 people, and they rotate their jobs throughout the day whenever they want to — this eliminates most of the repetitiveness and boredom usually associated with manufacturing. They usually spend 1 year in a division (like plastics) before moving onto other types of jobs in the plant, so that way employees learn about all aspects of manufacturing/production. Team members are encouraged to find ways to improve the process and implement these ideas. Using a “frame rotator,” the truck chassis were flipped upside down to aid team member ergonomics during assembly of the drivetrain. We saw several robots by Kawasaki, and automatically routed (guided) vehicles to transport auto parts.

If an auto plant can do this, what would it look like if we incorporated this philosophy into software development? Software programmers and test engineers would have the authority to raise alerts and hold up software releases instead of a manger having the final say in triage of bug reviews. People would rotate between software and test. What if the space shuttle launch could be delayed by any engineer on the team instead of being determined by launch managers? To make this work all engineers have to have sufficient system level knowledge and be “trusted” with the authority to make these decisions.

There is a hierarchy,

skilled worker — several of whom are organized into quality circles
team leader of 4-5 workers (tends to be nurturing)
group leader of several teams (tends to be more disciplined)

Only the teams are evaluated for performance, not individuals. People can get fired, but they can’t get layed off. The plant operates in two 7.5 hour shifts for a total of 15 hours per day, five days a week so people have weekends off. According to the tour guide, job rotation was the big thing that drew employees to the plant, not work or guaranteed employment.

The big difference that comes out here is the relationship with team members and that trust turns into better results.

Posted in Technology, Trust, Workplace | Leave a Comment »

	samsung tablet pc on Phrases, natural language proc…
	Mitch Ricks on “It is possible, in life…
	Text Classification… on Phrases, natural language proc…
	Srinivas Rao on Seth Klarman on the credit cri…
	omkar on “It is possible, in life…

Sri Spot