September 29, 2010

Open-source software in the Internet of Things: why we need repository-less package management system

Software has become one of the most critical User-Generated Content (UGC). The number of software that are daily created or updated is overwhelming: the SourceForge community aggregates more than 2 millions of software producers, contributing on 240,000 projects software. The increasing popularity of application stores (e.g. more than 180,000 applications in the Apple Store) confirms several trends in the software industry:
  • crowdsourced software has become a key economical argument. Apple typically takes advantage of the number of third-party applications that are available exclusively on its devices. The capacity to offer, in a short time, the largest and most diverse amount of software and services is a challenge. In this context, most large actors of the communication industry, including phone manufacturers and network operators, propose incentives for developers (from monetary compensation to open access to data and API), which tend to reinforce the proliferation of new software.
  • pervasive environments need crowdsourced software. The explosion of the number of devices, as well as commercial issues (especially the time-to-market), induce a gigantic demand for software development. Actually, this demand exceeds by far the capacity of classic software producers. For example, the strength and dynamism of the Linux community is a key factor explaining the rising popularity of Linux OS for small devices.
In comparison to classic UGC aggregation, the management of user-generated software is a challenging task. Indeed, modern software often consist of a huge number of small packages. These packages have inter-dependent relationships that may easily be broken during the deployment life-cycle. Thus finding an efficient and reliable way to maintain, distribute and install these software packages over billions of machines is definitely an issue. In the current approach, software distributors rely on a set of repositories, which are centralized servers collecting all the packages that have been certified. We distinguish two major drawbacks in this architecture:
  • the certification of packages. The software distributor plays the role of a certification authority. Users must deposit their packages if they want them to be integrated into the repositories. The distributor verifies the integrity of the submitted packages and makes the valid ones available for other users to download. As addressed in the EDOS project, there exist various approaches and tools facilitating the management of large repositories of packages. However, the centralized structure requires expensive infrastructure and extra human management. The process of certificating third party packages is slow and complex. Typically, developers complain about the increasing delay for software availability in the Apple Store. Clearly, a centralized certification of packages does not scale. It is also a severe threat for the privacy of users.
  • the delivery of packages. It has been emphasized by Microsoft researchers that a set of repositories can not ensure a fast, planet-scale, delivery of packages. However, massive delivery of software patches is a key security requirement. If the number of devices grows as it is commonly admitted in the Internet of Things vision, the limits of a centralized repository-based architecture will soon reach its limits. Moreover, devices in pervasive environment are not necessarily always connected to the Internet. We need to also rely on intermediate devices and opportunistic ad-hoc communications if one wants to upgrade all devices, including the tiniest ones.
We need to revisit, in a clean-slate approach, the package management system: a fully distributed (repository-less) system, which presupposes a modification on the common inter-dependent relationships between packages. We propose an internship, which is expected to be a small first step in that direction.

September 15, 2010

One academic world, two divergent ways to live it

The academic world is like the media industry. Some actors understand the opportunities offered by digital world, the others are still unable to revolutionize themselves in order to fit with our century.

On one hand, you have the unexpected success of a Q&A website devoted to Theoretical Computer Science. Anybody can post question, anybody can suggest answers, anybody can vote on the relevance of these answers. A reputation score is given according to the number of received votes, this reputation score allowing you to slowly become a kind of administrator. Such a website is often associated with chatting teenagers. In this case, more than one thousand of serious academic people subscribed (a third of the whole community?), and now these serious people everyday chat about problems related with theoretical computer science. The bootstrap was uneasy, but the success is here. For example, quantum computing was a hot topic today with two threads. Active participants include PhD students, unknown people, distinguished professors...

Wait, these guys who are expected to review the crappy papers I submitted to prestigious journals are wasting their time chatting with friends instead of doing their job? Well, it seems that the emerging conversation between scientists is worth spending a significant time on the website.

On the other hand, you have the editor of a journal in the network community. I reviewed a bad-but-not-so-bad paper two months ago. The editor sent me a kind email yesterday in order to inform me that, based on the different reviews (two or three reviews I suppose) the paper has been rejected. I kindly requested the other reviews. I just want to know what other scientists who read the same paper as me have thought about this paper. Were they as harsh as me? Were they annoyed by the same weaknesses as me? Did I miss important flaws? Did I misunderstand some points? I received a kind reply "Sorry we don't do that". The reviews exist, but the reviewers cannot access them because the editor has decided so. In parallel, I am in a Program Committee (PC) for a workshop. The reviewing platform does not authorize me to look at the papers that have not been assigned to me. I complaint te PC chair, but his reply was "The main task of TPC members is to give their technical opinion about the papers assigned to them. It would not be of any use if you could access the other papers, since those papers will have their own TPC members." And if I just want to do something that has no use for you, but has interest to me? And if I want to review other papers just for fun? And if I found that a paper that has not been assigned to me deals with a topic I find interesting? And if I want to contribute to the discussions about an exciting paper?...

I am not surprised that it is more and more difficult to find motivated reviewers able to write their reviews on time. What is the incentive to write a review if it is not part of a conversation? The collaborative work about the P vs NP story has demonstrated that collaborative reviewing is far better than just a sum of blind reports.

We could obviously go farther. Here is a list of small changes, ranked from the easiest to the most difficult to admit:
  • all TPC members access all papers,
  • all TPC members access all reviews,
  • all TPC members write reviews for any paper,
  • all authors access all papers,
  • all authors access all reviews,
  • all authors write reviews for any paper,
Would it be a perfect way to prepare a workshop where participants actually discuss?

September 6, 2010

Research in decentralized peer-to-peer: death and need

Gnutella and Kazaa appeared at the end of the last century. The promises of these systems has fostered a intense research activity in the area of peer-to-peer networks. The two most cited papers in Computer Science between 2000 and 2010 are both related with peer-to-peer systems. At that time, the motivations that researchers were authorized to admit were the scalability, and the dependability. The design of free systems (i.e. without any central authority) has never been a convincing argument neither for reviewers, nor for funding agencies. For example, two classic papers in the literature of peer-to-peer -- bit-torrent and freenet -- have been published in minor crappy conferences.

So far, data-centers have demonstrated to be scalable and dependable. In this context, the interest for peer-to-peer systems declines. Immediately, the main conferences dealing with peer-to-peer have claimed to be open to submissions of papers being not totally distributed: it is the time of peer-assisted architectures, and overlays of devices controlled by a central authority (e.g. set-top-boxes). See for example this paragraph in the Call for Papers of the ninth workshop on Peer-to-Peer Systems (IPTPS 2010)

"This year, the workshop's charter will be expanded to include topics relating to self-organizing and self-managing distributed systems. This is in response to recent trends where self-organizing techniques proposed in early peer-to-peer systems have found their way into more managed settings such as datacenters, enterprises, and ISPs to help deal with growing scale, complexity, and heterogeneity. In the context of this year's workshop, peer-to-peer systems are defined to be large-scale distributed systems that are mostly decentralized, are self-organizing, and might or might not include resources from multiple administrative domains."

Another consequence is that the only area where peer-to-peer experts can reasonably argue that pure peer-to-peer systems make sense -- live streaming systems -- has received an dramatic attention fueled by tons of grants: more than seven thousands papers containing the words peer-to-peer, live and streaming have been published since 2009. From an algorithmic perspective, the similarities between pure peer-to-peer and Content-Centric Networking make that this latter is becoming a hot topic among the peer-to-peer experts. To my opinion, the gap between this sudden peak of scientific works and the need for research in these areas is huge.

But, what about the research about fully decentralized peer-to-peer architecture for free systems? The troubles around Wikileaks, the recurrent funding issues faced by free services like wikipedia or arXiv, and the terrible privacy problems of current social platforms should invite every reviewer (not only in conferences but also in funding agencies) to consider the "free systems" motivation as critical.


September 1, 2010

Content-Centric Networking and the Revolution of Content Delivery

Many scientists in the networking field are excited by what has been initially called Content-Centric Networking. Recently, two national funding agencies have announced large projects in this area: Named Data Networking by the US NSF, and Réseaux Orientés au Contenu by the French ANR. Here is my focus on this topic.

It seems that new generations of Internet routers will have the capacity to cache content. Their future deployment represents an opportunity to revisit the techniques that are currently used in the Internet to deliver content. So far, the flaws of the Internet and the drawbacks of IP-layer multicast have been overcome by Content Delivery Networks (CDN) such as the Akamai network. In brief, a CDN is comprised of around a hundred of thousands servers, which are located as near as possible of end-users' networks. These servers are in charge of storing and delivering the content of their clients (here, some service providers) to the end users. Somehow, the predominance of CDNs is a part of the network neutrality debate, because small service providers can not have the same quality of service than Akamai-powered incumbents.

The seminal works done at the Palo Alto Research Center (PARC) addressed the fundamental issue of routing queries and data based on content name. These works enable the exploitation of the caching feature of new Caching Routers. However, the management of thousands of in-network Caching Routers is still an open question, which has to take into account:
  • the distributed nature of this caching system. Contrarily to the centralized management of CDN, the envisioned network of Caching Routers is by nature distributed: every Caching Routers is expected to decide by itself whether a content that it routes should be cached or not. Moreover, a claimed objective is to retain the simplicity and scalability of current Internet protocols. Actually, Internet works because it is simple, let's stick to this approach.
  • the complexity of the peering relationships between autonomous networks in the Internet. Internet is a loosely-coordinated aggregation of networks. The equilibrium of the whole Internet depends on the selfish actions of every network. The deployment of Caching Routers is among the few events that have the potential to significantly impact the behavior of inter-network relationships, and affect the global Internet.
  • the evolution of content. Cisco claims that video traffic will represent 90% of the overall Internet traffic in 2014. If video clips à-la-YouTube can be treated as a classic cacheable content object, many other forms of video services are emerging. In particular, time-shifted streaming is becoming a major trend, for TV of course, but also for potential life-streaming systems (lifecasting). As we have recently showed, these new forms of video consumption represents a challenge for network management.
These challenges are actually exciting. Yet, as usually in the networking community, scientists work in close projects, and prepare papers, which are submitted in prestigious conferences like Infocom, or NSDI, but are too rarely released in an open library like arXiv.