Tag Archives: Adobe

[repost ]Open source graphic design


How to work as a Graphic Designer without sleeping with Adobe

I am just now completing a certificate in graphic design at the online schoolSessions.edu, and although I have come to like the Adobe Creative suite tool (and mainly inDesign), I would rather be relying only on open source tool. Why is that? Well there are many factor that steer my decision toward that.

If you want to read more about open source and creativity I invite you to readFLOSS+Art which is a solid explanation of the relationship between open source and creativity/art. From the description:

“FLOSS+Art critically reflects on the growing relationship between Free Software ideology, open content and digital art. It provides a view onto the social, political and economic myths and realities linked to this phenomenon.”

This book has been a strong inspiration and a corner-stone for me to embrace open source in all my creative project.

Using open source instead of a licensed software is quite empowering and freeing. Even since I paid for Adobe Creative suite, I still don’t feel that I own the software, which influence the way I create and express myself. Even worst, when I used to download pirated software, my creativity would be hindered by a feeling of guilt. All that lead me toward using only open source tools and all of its advantage:

+ Free software – meaning you don’t pay for it (you still can donate what you want!)
+ The code is open, if you want your tools to do something different you are welcome to change it.
+ Solid community of user and developer to contact when in need
+ You can upgrade your software as many time you want (without paying more!)
+ Peace of mind for any copyright infringement or licenses / legal trouble with the tools you are using

There is a quality and peace of mind when using open source software that is hard to explain, it is often depicted as ‘If you can’t open it you don’t own it’ which is not to be taken literally but which give an understanding of the technology we use. This freedom is priceless hence for me creativity and open source is a perfect match.

If you are to become a professional Graphic Designer, you still need to be familiar with the industry default tools, the Adobe suite, but once you’re working on your own project, personal ones or for clients, you can use the tools that fits your needs. Of course it’s not the tools that makes the designer, only your skill and creativity.

I’ll list here all the tools that I’ve been working with in the last few months, for school project and professional contract. I’ll restrain the list to simply graphic design at the moment and might expand it further into web design in a future post.


Photoshop replacement with the Gimp

One thing I like about these open source program is that they don’t try to do everything. For example nowadays with Photoshop you can actually do pretty much all your design and illustration in there. But why is that? I personally prefer tools that does one thing, and one thing well. I guess that was the idea behind all the tools of the Adobe creative suite but the goal got lost in translation and now every tools try to do everything.

The gimp is a great example of a tool that does what it does, and well! It’s a really solid image retouching and photo editing software. It a really mature open source project with a huge community of users and developer. It’s intelligently built and can be extended with Scheme or python script!

Illustrator replacement with Inkscape and MyPaint

I’ve been using Inkscape more and more lately and I have to say that I am in love again! With the focus on keyboard shortcut and simple tool navigation, Inkscape become a second nature quite easily. It’s powerful and fast as well as quite flexible. It has a lot of the basic function of illustrator, isn’t overloaded with bells an whistles.

One thing that Illustrator try to do but fail at it, is to be a painting software. With its brushes and the pressure sensitive drawing, it almost work, but I really can’t paint with illustrator. On the open source side of the world, the gap is well covered by MyPaint.

MyPaint is simple, elegant and to the point. It’s optimized for tablet, and use a minimum of control to change color (with palette), change brushes and move your canvas around. It’s the only tool that give me the real feeling of drawing, and with the different paint brush. You can easily create your own palette and your own brushes, or hack any brushes that came pre-installed.

InDesign replacement with Scribus

Indesign is a hard one to replace, maybe because I really enjoy working with it, or maybe because I haven’t played enough in the open source world to have completely let go. The main, mature page layout tool available now is Scribus, you will find a similar interface as InDesign and it doesn’t take long to get the basic commands. It’s a solid page layout software, to design a poster or a whole book. Since I haven’t work with books with Scribus yet I can’t tell how good or bad it is, but having using it few time I can see that it’s a viable software.

I am also really interested in the development of CSS for designing books and print. Coming from a web design background it seems to make a lot more sense to code your style in css. You can read a really interesting experiment here on a list apart.

Other useful tools

Gpick, a quite powerful color picker which will also help you create color scheme and even try out the palette, You can download gpick here

Another color management with Agave.

For font management in Linux: font-manager and fontforge.

Image batch processor Phatch.

Screen ruler, I enjoy being able to measure what is on my screen and I can do that easily with screen ruler . It’s a few step installation but it worth the time invest.

Some time you need to write and only write without Facebook email and Skype disturbing you every second, in order to do that I love to us PyRoom, which is a really simple distraction free writing tool.

One thing with open source is it’s maturing all the time, the community is growing and its the counter current of the huge software company that is mainly preoccupied with profit. The difference between using an Open source software and a paid software has quite an effect my creativity and I invite you to try it out, and observe how it feels!

[repost ]Adobe:了解HTML5 语义–第一部分


如果你的业务与我相似,则你最近肯定已经听到大量关于HTML5的新闻。 大家都在谈论”Flash killing” 视频元素、带有画布的动画及定位功能(geolocation)等话题。 事实上,围绕HTML5的讨论已经扩展到包括许多根本不是HTML5 的话题。 在整个web上,人们都在讨论CSS3的富于表现的新功能,而同时将它们称作HTML5。 而在HTML5的势不可挡的营销噪音中,常常遗忘的是引入的新元素和其它语义的改变。 在本文中,我希望帮助你了解和学习如何恰当地使用这些虽然未必是令人兴奋的,但绝对是非常重要的新语义元素。 尽管这一主题可能显得枯燥乏味,但实际上通过恰当地使用这些新元素你可以将新的丰富含义添加到你的markup中。




熟悉如何利用HTML和 CSS创建web页面。



我们究竟为什么引入新元素? 我们需要它们吗? 它们来自何处?

考虑一下你的代码。 你曾经使用过 <div id="nav"><div><div id="footer"> 吗? 这就是这些新元素的出处。 数百万(甚至数十亿)web页面像蛛网一样交织在一起,并且相应的公共类名称汇集在一起。 你可以想象,在第无数次看到"div" 之后,作为一个分析师,你一定知道你正在做有重要意义的事情。

事实上,上述列出的类和ID是三个最著名的元素—headernavfooter。 并且它们对我们大部分人来说非常重要。 它变得更加难以处理的地方是添加 article, section,并且aside究竟是什么? 更为重要的是,它们应该位于页面的什么位置? 我将不会在本文中对它们全部进行讨论,但会讨论在HTML5中引入的一些新元素(请记住仍然有几个元素在弹入和弹出下表):































图 1. 首先你需要确定你希望使用的元素

大多数页面均在顶部有一个页眉、在底部有页脚、也许在页面下面(或里面)或在一个边栏中有一个导航条。 但以这些术语进行思考的确带有 HTML4/XHTML 的色彩。 换句话说,HTML5 工作组能够与新元素本身一起,根据内容的质量为我们的页面内容标价。 你不应该根据内容在页面上的”位置”而是以该内容与其它页面的关系进行思考。 此外,更进一步地,什么是页面和区段等的内容角色的特定部分?

这些元素能够相互之间嵌套。 事实上,一个页眉可能发现自己位于一个导航条中,或一个页脚可能发现自己位于一个条目中。 但在你的大脑变成一个标记曲解者(markup contortionist)之前,让我们讨论一下这些元素本身。 新元素中的4个元素被称为 分节元素(sectioning elements)。 这些元素更像新的乐高(Lego)积木块,它们可以通过你当前使用的divs(并且你绝对会继续使用divs)嵌入。 它们是article、 aside、 nav和 section


这些分节元素能够为你的页面创建大纲。 在HTML4/XHTML中,通过我们标记的标题级别可以隐含地创建相应的大纲。 Divs对页面的大纲无效。 相应的结果有点类似毕业或研究论文的大纲。 Wikipedia 能够为每篇文章显示一个大纲 — 是的,它也是基于相应的标题级别(参见图2)。 每篇文章以 H1级别开始,然后进入顶级 H2标题,其中带有嵌套的H3、H4等标题。 相应的结果是一个他们放置于用作导航的页面的隐含大纲。 合理地使用标题对使用辅助技术的开发人员来说帮助很大,因为他们能够请求你的一列标题然后跳转到页面上的逻辑位置。

Wikipedia 能够为每篇文章显示一个大纲以便于导航

图 2. Wikipedia 能够为每篇文章显示一个大纲以便于导航

利用HTML5能够显式地创建大纲。 你可以使用分节元素 (sectioning elements)而不是标题来为我们的文档创建章节。 这些元素能够正确地创建文档的目录,而不论它起始于什么标题级别。分节元素 能够 起始于 H1,但不论一个章节起始于什么标题级别,相应的层级均从这里递减。

使用一个H1级别开始每个分节元素是完全可以接受的方式。 它允许提供具有逻辑的组织架构,这样能够从一个CMS中拉出代码程序块并且能够在保持其语义结构的同时将其用于网站的任何位置。 然而,目前—尽管辅助技术能够补偿这一缺陷—但最好针对章节的嵌套级别使用合适级别的元素。 这意味着—利用分节元素创建大纲,但目前应该继续使用H1、H2和H3层级开始这些章节。 如果你内容的某个部分没有开始大纲的一个新部分,你应该针对它使用一个div。



4个新分节元素的第一个也是最普通的元素是 section。 一个section表示文档或应用程序的一个普通章节。 依据相应的 HTML5 文档*:

“一个section, 在文章中,是一个具有主题分类的内容,通常拥有一个标题

Section的范例包括章(chapter)、位于带标记符的对话框中的各种带标记符的页面(page),或一篇论文中的带标号的节(section)。 一个Web网站的主页能够划分为引言、新闻条目和联系信息等 section。”

记住,当一个元素仅仅用于提供式样目的或用于方便提供脚本,则仍然应该使用一个div。 Section元素并不是那么通用。 它可以为你的页面定义一个部分,该部分应该为页面的大纲创建一个新节。

如上所述,一个网站的主页—为了吸引用户进入网站中内容的各个不同部分—是一个能够发现多个section元素的常去公共之地。 信息页面可能也会在其中包含多个section。 因此,相应的代码能够按照下面的方式进行添加:

  <h1>British Virgin Islands</h1>
  <h2>A bareboat charter wonderland!</h2>
 <p>Want to go sailing on your vacation? 
 Among these Caribbean jewels, 
 there are options for both beginners 
 and experienced charterers…</p>
  <h1>Virgin Gorda</h1>
  <p>The Baths at Virgin Gorda are truly 
  one of the most picturesque places in the Caribbean.<p>
  <h1>Norman Island</h1>
  <p>Whether it's snorkeling at the Indians 
  or drinks and night life at Willy T's floating restaurant, 
  the Bight on Norman 
  Island gives you 
  a full range of choices…</p>
  <h1>Jost Van Dyke</h1>
  <p>This small island contains 
  several great evening activities 
  including the Soggy Dollar Bar and Sidney's Restaurant…</p>


关于 article 元素的使用,网站上存在很多争论。 起初,对某些人来说,该 规范* 该规范显得有些模糊和混淆。 随着以后的进一步澄清,该元素已经被定义为:

“一个article 元素表示包含于一个文档、页面、应用程序或网站中的一篇独立的作文,也就是说,它能够独立地发布或重新使用,例如通过供稿联合组织在报刊上同时发表新闻文章(in syndication)。 它可以是一个论坛帖子、一篇杂志或报纸文章、一个博客条目、一个用户提交的评论、一个互动 widget或 gadget、或任何其它独立内容条目。”

混淆部分起源于术语article 本身的使用,因为通常它是你在杂志、报纸或博客中看到的一种写作形式。 但,请不要为”in syndication” bit而困惑。 这并不意味着article元素仅仅适用于博客帖子或实际上通过供稿联合组织在报刊上同时发表新闻文章。 它意味着这一内容文章在需要时能够独立提供,并且你可以拥有你需要了解它是什么类型文章以及它来自何处的所有信息。

article 的字典定义之一是 “一个类的一个单独的物体、成员或部分;一个条目或特别的:一种食物;衣物。” 因此,将你的思考方式从出版界的article用法改变为对”完整物体” 或条目的更为简单的理解是提供一些明确性的第一步。

当然,现在让我们将一个博客帖子作为一个范例。 我不会试图在你的脑海中混淆我刚刚表述的观点,但有一些其它观点适用于本范例,当然—article元素适用于博客帖子。 记住上述的规范提及到 “用户提交的评论”也是一个article。 关于以这一方式添加一个评论的合法性已经引起很多争论。 但作为一个article添加的评论必须嵌套到其原先的文章中。 它不会放置于它评论的文章的结束标记符之后。 因此,它在语义上被看作一个与放置它的原先条目相关的条目。 然而,一个评论通常独立的。 它具有发布它的相关人员的辨别信息—姓名甚至还有头像;时间/日期戳;以及所有的评论。 它是独立的并且可以通过什么人在什么时间编写它等信息对它进行识别。

  <h1>Anchoring isn't for beginners</h1>
  <p><time pubdate datetime="2009-10-09T14:28-08:00"></time></p>
 <p>If you've ever chartered a 45ft catamaran, you know that mooring balls are your friends. They protect the coral and sea bottom from the constant abuse of frequently anchoring boats.</p>
    <p>Posted by: Peg Leg Patooty</p>
    <p><time pubdate datetime="2009-10-10T19:10-08:00"></time></p>
   <p>Right! Mooring balls are for wusses! Pirates only use anchors! Arrrrr!</p>
    <p>Posted by: Ariel</p>
    <p><time pubdate datetime="2009-10-10T19:15-08:00"></time></p>
   <p>Thank you for thinking of what's under the sea. Even Ursula would be thrilled.</p>

注意博客帖子 (父 article) 和评论 (子 article) 均可以通过新的 time 元素进行添加。 依据相应的规范:


请注意术语 机器可读(machine-readable)。 对于添加数据操作,它正变得越来越有用,这样机器能够对他进行解析并且能够在你自己的程序中自动使用这些数据或为各种应用创建糅合(mashup)解决方案。 这是为什么说语义是重要的一个原因。 当机器了解数据的含义—并且这将包含搜索引擎—该数据将变得更为丰富和可操作。

你也许还注意到 time元素中的Boolean属性 pubdate。 依据相应的 规范*,该属性 “表示元素给出的日期和时间是最亲近的祖辈 article 元素的发布日期和时间,或者,如果元素没有祖辈article元素,则是作为一个整体的文档的发布日期和时间。”

换句话说,该属性为解析你的数据的机器指明该 time 元素是评论或文章的实际发布日期。 如果你在你的页面的全局页脚的 time 元素中使用 pubdate,这将表明它是该 web 页面自己的发布日期和时间。

在最后两个分节元素之上 …


第三个分节元素与 你的网站和其页面的导航*有关:

“nav 元素表示链接到其它页面或页面中其它部分的一个页面的一个区段:一个具有导航链接的区段。 . . . 该元素主要用于由主导航程序块组成的区段。”

在现代web标记语言中,为到网站的主要区域提供一组链接(也许通过下拉菜单选择)、在网站的一个区段内提供一组链接、以及在页面分成多页的情形下提供通过一些分页技术帮助你导航的一组链接是非常通行的作法。 你也可以拥有到你推荐的其它网站的一组链接(blogroll)、针对特定主题的一组资源链接以及在你的页面的页脚重复顶层的链接(这样用户能够避免滚动返回到页面的顶部以选择他们的下一个目的地)。

第一个需要考虑的集合是”主导航”组。 它们均是用户需要在整个网站或网站的一个区段中进行导航的链接组。 它们是在你希望导航到别的页面时你希望用户使用辅助技术能够直接跳过(以便先看到相关的内容)或直接跳到的组。

在决定将它们标记为”主导航”之前,必须小心地考虑第二个组。 任何离开你的网站的链接组应该不能标记为nav—它取决于这些链接的目的。 如果它是一组用于为你的公司注册事件的链接,并且它们均链接到Eventbrite,那么我将它们认作与你的网站相关的导航。 但你希望建议作为他们也喜欢的一些事情(如blogroll)的许多链接可能不是如此。没有必要将位于页脚的你的网站导航的一些副本标记为 nav 元素。 但这也不是错误的操作。

你应该知道 nav 元素能够包含各种式样的导航—而不仅仅包含无序的列表。 尽管无序的列表或P 元素一定是最常见的内容,但你也可以将你的导航编写为一段诗词或散文。 只要元素的目的是用于用户进行导航,这是完全能够接受的。

这是来自 Ian Hickson* (WHATWG的主席)自己的引言:”不要使用 <nav> ,除非你认为 带有 <h1>Navigation</h1> 的<section> 也是合适的。”


最终,我们将讨论最后一个分界元素— aside。 很明显,这一元素的名称源自人们称呼为sidebar、aside或 sidepanel的页面的区段。 这些单词的 “side” 部分实际上表示它在页面中的视觉上的位置。 但aside并不意味着反映这一意图。 下面是相关 规范* 的描述:

“aside元素表示由与围绕该aside元素的内容几乎没有相关性(tangentially related)的内容组成的页面的一个区段,并且可以认为它与该内容是独立的。 通常利用打印的印刷体将这些区段表示为sidebar。

该元素可以用于像提升引语(pull quote)或 sidebar等印刷效果、广告、nav 元素组和被认为与页面的主要内容分离的其它内容。”

什么是几乎没有相关性(Tangentially related)? 其含义是有一点相关性。 在决定内容是否应该标记为aside时,首先问问你自己下面几个问题:能够认为它与该内容是分离的吗? 你可以删除它而不会影响文档或章节的含义吗?

尽管你一定能够使用它保存大量导航和广告链接—无论它们是不是位于页面的侧面—但你也能够在与它相关的章节或文章中直接使用它。 例如,一段杂志式样的提升引语(pull quote) 能够按照下列方式进行标记:

<p>If you've ever chartered a 45ft catamaran, you know that mooring balls are your friends. They protect the coral and sea bottom from the constant abuse of frequently anchoring boats. They're also one of the easiest ways to secure yourself while at sea avoiding the dreaded "anchor watch" that can keep you awake half the night.</p>
 <q>Mooring balls protect the coral and sea bottom from constant abuse…</q>
<p>Learning mooring techniques isn't overly difficult. Communication is key between the person grabbing the ball with the hook and the sailor at the helm.</p>

可以删除提升引语(pull quote )而不会破坏文章的内容,因为它仅仅利用特殊的印刷体重复来自该文章的引语。 你不应该在应该保留在该文章中的插入句中使用aside元素。 如果在删除时影响文章(或章节)的内容,它可能不是一个aside元素。

记住,如果遇! 它们被创建的目的是为我们提供比使用一个div元素更易于描述我们内容的新元素。 尽量保持合理的思维方式并且关注相应的目的。 在标记你的网站时,不要看链接将去向where 不要看链接将去向哪里到所有这些新的分节元素—别紧张,而是看它们是什么以及它们的相互关系如何。 然后,任它而去…


还有一些与我刚刚讨论的元素紧密相关的元素,但它们不会创建一节新的大纲。 其中最重要的两个是 header 和 footer


尽管 header 元素在页面的顶部能够找到自己,但它实际上是在页面中标识内容重要性的一个元素。 (不能将 <header> 与你的文档的 <head> 或 heading 元素— h1 ,h2 , h3 ,等等相混淆)相关规范的描述如下:


header元素通常会包含章节的标题( h1–h6 元素或 hgroup 元素),但这不是必需的。 header 元素还能够用于包含章节的目录、搜寻表单或任何相关的徽标。”

在一个页面中可以允许使用一个以上的header 元素。 通常你的页面需要一个全局header元素,但你也可能在分节元素中需要使用一个header元素。 请记住,尽管分节元素总包含一个标题元素,但它可能仅仅是 h1 – h6 的其中之一 ,很可能包含于一个 hgroup 元素中的合适位置。


有一个名称为hgroup的新元素已经在规范中多次进出。 目前,它结束的位置仍然有待明确,但 hgroup元素的创建目的是在存在两个前后紧跟的标题的情形下成为一个 wrapper (或shield) ,并且只有第一个标题能够包含于文档显式大纲中。 考虑一个网站名称和标签线:

  <h1>The British Virgin Islands</h1>
  <h2>Jewels in the Caribbean</h2>

在该文档的大纲中, h1 元素能够显示出来,但 h2 将被大纲屏蔽。


与 header 相同,你可以在一个页面中拥有一个以上footer。 你可以拥有一个全局footer,但分节元素也能够拥有多个footer。 相关的 规范* 给出如下的描述:

” footer 元素表示其最亲近的祖辈分节内容或分节root元素的一个页脚。 Footer通常包含关于其区段(section)的信息,例如谁是作者、到相关文档的链接、版权数据、等等。

当 footer元素包含全部区段( section)时,它们表示附录(appendixes)、索引( indexes)、 冗长的版本记录(colophons)、详细的许可协议(license agreements)和其它类似的内容)。”

当你对一个博客帖子进行标记时,你需要在该帖子的底部和顶部添加谁是作者和编写日期等信息。 如果你希望将这些信息放置于页脚,这样做显然是合理的,则你可以使用一个 footer元素。 你将不会把信息放置于顶部的一个 header 元素中以及底部的一个页脚中。 它是相同的信息,因此如果它是复制的信息,则应该包含于相同的元素中。


尽管 address 不是新的元素,但我仍然希望提及它,因为它很可能更多地与新的语义元素一起使用。 请不要感到困惑。 address 元素不涉及你的家庭、办公室或商务邮政地址。 它实际上具有 如下的描述*

“address 元素表示其最近的 article 或body元素祖辈的联系信息。 如果它是body元素,则该联系信息能够应用于整个文档。”

这说明它通常是作者的电子邮件地址或到一个web页面的链接。 通常它放置于页脚中。 如果它放置于全局页脚,则它与整个页面相关。 如果它放置于一篇文章的页脚,则它只与相应的文章相关。 只有在邮政地址实际上是相应文章的联系信息的情形下,才使用邮政地址。

<figure> and <figcaption>

我们将要讨论的最后一个元素是 figure 元素:

“… 该元素能够随意添加图片说明,它是独立的并且通常能够从文档的主流中被引用为一个单一单元。

因此该元素能够用于注解插图(illustration)、示意图(diagram)、图片(photo)、代码列表(code listing)、等等,它们能够从文档的主内容进行引用,但在不影响文档流的情形下,它们能够从主内容中移植到页面的旁边、专用的页面或附录等位置。”

人们所犯的错误是相信他们的图像应该全部包含于 figure 元素中。 实际上,figure 元素应该被想象为一本书中的插图。 它可能是或不是一张照片。 并且它可能有或没有图片说明—这取决于它在内容流中的位置。 它可以是用于一个代码范例或web中教程的插图的完美合理的使用方法。

如果figure 元素包含图片说明文字( figcaption 元素),则它将包含于相应的父figure 元素(参见图3)中。

 <img src="virgin-gorda.jpg"
 "The boat as seen through the rocks at the Baths on Virgin Gorda.">
 <figcaption>The Baths at Virgin Gorda</figcaption>

图片说明文字包含于相应的父figure 元素中

图 3. 图片说明文字包含于相应的父figure 元素中


现代的浏览器在为大多数新的 HTML5 元素提供式样时不存在问题。 然而,较老旧的浏览器需要进行少量的手工调整。 如果你已经在 Dreamweaver 中安装了HTML5(或你已经安装了CS5.5),则你可以打开它包含的新HTML5布局。

  1. 转到 File > New > Blank Page > HTML > HTML5: 3 column fixed,header 和 footer。
  2. 确保你已经选中一个 HTML5 doctype,然后点击 Create(参见图4)。

Dreamweaver 包含两个新的 HTML5 starter 模板

图 4. Dreamweaver 包含两个新的 HTML5 starter 模板

在 CSS的结尾部分(无论你是否将其作为附件或将其放置于文档的标题中),你将看到该评注和选择器( selector):

/*HTML 5 support - Sets new HTML 5 tags to display:block so browsers know how to render the tags properly. */ 
header, section, footer, aside, nav, article, figure { 	
display: block; 

将display属性设置为 block 使得较老旧的浏览器能够正确地渲染元素。 然而,对于较老旧版本的Internet Explorer (IE)(早于IE9的版本),我们需要使用辅助轮(training wheels)。 IE无法识别这些元素,除非利用JavaScript将它们注入到 DOM 中。 它将能够正确地渲染它能够识别的任何条目,但不会对它不能够识别的条目进行渲染。 这将会导致令人困惑的混乱局面。 注意下列在页面的 <head> 元素的结尾部分的评论:

<!--[if lt IE 9]> <script src=
</script> <![endif]-->

该链接位于一个IE条件评论中(只有 IE 才能读取的评论),它适用于任何早于IE9的版本。 它可以链接到能够启动IE的小型 JavaScript 文件。

让我们看一下该页面的构成方法以及新元素自身。 如果你在制作过程中使用HTML5 模板,你必须确保将这些元素放置于合适的位置以便与你的网站的语义结构匹配。 如果你的aside元素不能映射到sidebar元素和其它大结构差异,则你最好在 Dreamweaver 中创建一个新的、空白的 HTML5 页面,并且使用功能强大的代码提示(code hinting)和代码完成(code completion)来创建你自定义页面结构。


记住,如果你的脑袋充斥着所有这些新元素和语义—尽量保持简单化。 内容应该只基于它的类型而不是它的位置进行标记。 利用内容的片段来思考一下你的目的以及利用什么元素最能够表现它。 此外,当它是最恰当的元素时,使用一个div元素的确很合适。

了解HTML5 语义–第二部分*中,我们将学习HTML4(或者XHTML——在这篇文中交替使用这两个术语)和HTML5的文档结构的差别,包括新加入的全局参数。

为了能够开始使用HTML5和 CSS3,参见David Powers的由三部分组成的教程系列,Dreamweaver CS5.5中的HTML5和 CSS3介绍(HTML5 and CSS3 in Dreamweaver CS5.5)*

如需了解关于 HTML5、CSS3 和新的相关地理位置定位、存储和其它 API 的更多信息,参见HTML5 Developer Center* 中的资源。

Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License+Adobe Commercial Rights


本产品经Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License许可。Adobe提供超出该许可范围、与本产品包含的代码示例相关的权限。


Stephanie (Sullivan) Rewis 是 W3Conversion 的创始人,这是一家致力于 Web 标准和辅助功能的 Web 设计和培训公司。作为一名前沿开发人员,Stephanie 创建了 CSS Starter Layouts,它被 Dreamweaver CS3 引入并在 DW CS5 中得到更新。她对教学和分享充满热情,为不同网站和印刷出版物撰写过书籍和许多教程,包括定期为 Web Designer Magazine 撰写专栏文章,定期对公司的网络部门进行培训以及每年在许多会议上演讲。Stephanie 是 Adobe Systems 的 WaSP 联络人,与他们的产品经理共同确保制作出的 Web 产品可以不断推动今日的 Web 标准。她坦诚自己沉迷于工作,很少离开办公室,经常通过 Twitter (@stefsull) 与计算机中的小人物交谈。如果有时间,她的爱好是什么?研究脑功能。她的罪快乐?八十年代音乐。



[repost]Why we’re using HBase: Part 2 –>adobe


Blogging about the Hadoop software stack

Why we’re using HBase: Part 2

The first part of this article is about our success with the technologies we have chosen. Here are some more arguments (by no means exhaustive :P) about why we think HBase is the best fit for our team. We are trying to explain our train of thought, so other people can at least ask the questions that we did, even if they don’t reach to the same conclusion.

How we work

We usually develop against trunk code (for both Hadoop and HBase) using a mirror of the Apache Git repositories. We don’t confine ourselves to released versions only, because we implement fixes, and there are always new features we need or want to evaluate. We test a large variety of conditions and find a variety of problems – from HBase or HDFS corruption to data loss etc. Usually we report them, fix them and move on. Our latest headache from working with unreleased versions was HDFS-909 that causes the corruption of the NameNode “edits” file by losing a byte. We were comfortable enough with the system to manually fix the “edits” binary file in a hex editor so we could bring the cluster back online quickly, and then track the actual cause by analyzing the code. It wasn’t a critical situation per se, but this kind of “training” and deep familiarity with the code gives us a certain level of trust regarding our abilities to handle real situations.

It’s great to see that it gets harder and harder to find critical bugs these days, however, we still brutalize our clusters and take all precautions when it comes to data integrity1.

Testing distributed systems is hard. There aren’t many tools or resources. There are some promises for performance and scalability benchmarking tools, thanks to Yahoo! (we too await the open-sourcing of the YCSB tool), but right now you have to roll your own, and it takes time. There’s no clear test plan for distributed systems, no failover benchmarking tool, nothing on reliability, availability or data consistency.

Your system can fail no matter how well you thought you tested it, even if it’s sunny outside and you’re throwing a party (especially then). Google, Twitter, Amazon – all have had downtime. Everyone fails once in a while. It’s only a matter of time and users tend to tolerate a short downtime or performance degradations. On the other hand, what users will not tolerate is losing their data2. We are completely paranoid about losing data. If other failure scenarios resulting in degraded performance or even a little downtime are bearable, losing data is not.

We try to learn our systems by heart and be able to fix anything fast or even while the system is running. After more than a year, we’re OK with keeping data for our clients3, but we’re still testing and taking precautions.

Really thorough testing of ALL the solutions that we use has paid off. It’s a lot of work, because you have to build the scaffolding for it, but going back and forth keeping up with the changes and pushing fixes is a sure way to know the system in depth.

Demystifying HBase Data integrity, Availability and Performance

It’s good to know the strengths of a system, but it’s more important to be aware of and understand its limitations. We have extensive suites of tests covering both of them. Testing performance is pretty straightforward, but testing data integrity is the hardest, and here we spend most our time.

Data Integrity

Integrity implies the data has to reach “safety” before confirming the request to the client, regardless of any hardware failures that might happen.

HBase confirms a write after its write-ahead log reaches 34 in-memory HDFS replicas5. Statistically, it’s a rather small6 probability to lose 3 machines at the same time (unless all of your racks are on the same electrical circuit, power transfer switch, PDU or UPS). HDFS is rack aware so it will place your replicas on different racks. If you place and power your machines correctly, this is safe in most cases. If this won’t be enough for our clients, in certain critical applications, we will come up with stronger guarantees. (E.g. make sure that data is flushed to disk on all 3 replicas – not here today).

There are many questions that arise even if you do flush to disk. In a full power loss scenario, even if you flush to disk you need to consider OS cache, file system journaling, RAID cache and then disk cache. So it’s debatable whether your data is safe after a flush to disk. We use battery-backed write cache RAID cards that disable the disk caches. However, we’d rather make sure our racks are powered correctly than rely on disk flush.

Most of our development efforts go towards data integrity. We have a draconian set of failover scenarios. We try to guarantee every byte regardless of the failure and we’re committed to fixing any HBase or HDFS bug that would imply data corruption or data loss before letting any of our production clients write a single byte.


We feel the same about availability. When a machine dies, data served by that box will be unavailable for a short window7. This is a current limitation, and while we know how to make the system 100% available, given you lose a box or two, it’s a matter of prioritizing our efforts that we have chosen not to put effort into it yet. The reality is we can afford having a short8 downtime for data partitions – as long as we don’t lose any data. Also we don’t expect to have machines failing too often.

So, for us, juggling with Consistency, Availability and Partition tolerance is not as important as making sure that data is 100% safe.

Random Reads

HBase9 performance is good enough for us. That is, it’s more than we need right now. Would you strive to reach 5ms read performance instead of 10ms, or 1 second max unavailability instead of a few minutes when a server crashes, if you can’t guarantee data safety? We wouldn’t. Just as you’d accept a credit card failure once, you wouldn’t accept your accounts being wiped out anytime. So, we choose to spend our resources on ensuring data integrity.

Getting close to the 7ms average disk response time10, for small records, is possible with the current architecture. As always, the devil is in the implementation details. The architecture promises linear scalability, but it’s the implementation that makes it reliable. Moreover, we all know that data isn’t accessed uniformly random – this is the worst case scenario. We get ~1ms reads for data in memory today, and the read performance and throughput can be improved 10 fold by adding caching. (see this article)

Our performance results are notably better than the ones in the YCSB test paper. That’s for another post though.

Availability and Random Read Performance are possible “limitations” that we are OK with (for now); we are extremely happy with random write and sequential read performance against billions of rows11, however.


While you can cache for reads, scaling writes is harder. Write performance and sequential read performance enable two of the most important use cases: heavy write volumes12 and efficient distributed processing.

HBase has a great random-write performance. We are using HBase 0.21, which DOES sync the write-ahead log after every put call in the RegionServer, so the data is in the write buffer of at least 3 nodes13. In an RDBMS for example, you can replicate the data for improved read performance, but you can’t scale writes and total data size, unless you partition it. And when you partition the data you lose the original properties such as transactions, consistency, and your operational costs can skyrocket.

As systems mature, great write performance will not be solely an HBase advantage; we expect other storages to reach this performance, just as we expect HBase to reach the optimal random-read performance.

But what use is in being able to keep such large amounts of data without being able to process them efficiently?

Sequential Reads (Scans)

Again, our tests using MapReduce show great performance. HBase is built on the Bigtable architecture, which was thought-out to work with MapReduce, which makes it also a great fit for OLAP. Data location is deterministic and sequential rows are stored sequentially on disk, so HBase can read every 256 MB(configurable) of your table in a single request because data is not fragmented. It can do it in parallel too. So given enough processing power you can have each disk reading at full throttle.

Full Consistency

HBase is an inherently consistent system. After you write something, modifications are immediately available. You can’t get stale data, or have to reason about quorum reads. We think consistency is good, for a multitude of reasons: if you write an application over a consistent system, application logic is much simpler. One doesn’t have to take into account stale data, it’s just like single threaded programming: you’re going to read what you’ve written earlier14. Also, consistency is a solid base to build more complex primitives: transactions and indexes, increment operations, test-and-set semantics, etc.

It all comes down to engineering choices: it’s a good exercise for the reader to determine if a system which defaults to eventual consistency, that can accept writes at any moment on any node (e.g. using consistent hashing) and has data fragmented across the cluster, can perform optimally when it comes to sequential reads. How much network chat is needed to do a table scan?

It’s all about what we think is a sensible default: availability and partition tolerance deal with relatively isolated scenarios: you can compute the probability of losing a node or getting your network split in two. It’s relatively low. However, consistency is something you deal with in every operation you do.

This is all getting a little philosophical, but here’s a list of questions (not rhetorical ones), related to this:

An eventual consistent system could be configured to support full consistency and/or data ordering. Would this impact or degrade other attributes like availability and performance? A system that can juggle with C, A and P, is quite flexible. But what part of CAP do you want to support by default, and what’s the impact when you change it afterwards?

Our assumption is that building on consistency is an appealing and sound decision, and any architecture that doesn’t handle this in its default design will lose the performance and availability when forcing it later on. Partition tolerance is not something that we think is worth handling within a single datacenter (redundant datacenter equipment investments are pretty much the norm for both electricity and networking). We do however care about partitioning when doing multi-datacenter replication.

Which of C, A or P do you think will hold the greatest impact? (Hint: for us it’s C :D)

HBase’s edge is in the H

We created a fair amount of tests that we maltreat our system with. It takes effort to implement correct fault tolerance and there’s an advantage in relying on Hadoop for it. Also, in the last 4 years, there was a large client base15 that validated Hadoop by using it in their production systems (especially Yahoo!). This had a real impact in the stability and fault tolerance of the system.

Now consider a different system, built from scratch. It has to enable and test all that Hadoop does starting from 0 (again, architecture is a promise, but implementation is what you USE). In a best case scenario, this system will gather a critical mass, a community will be created, and it will evolve organically, etc. Hadoop and HBase have that today, and it’s a big advantage.

We pride ourselves in keeping up with new technologies, but we think that Hadoop and HBase are over the “safety” threshold, for what we need to do.

HBase has an “edge” in Hadoop over other technologies, in that, just like Hadoop, it fills the gap between storage scalability, fast processing and cost-efficiency. Why was Hadoop successful? We think it’s because it didn’t rely on a narrow vertical need. Hadoop did not build something that was impossible before. We had NAS systems, and OLAP cubes for data processing, but Hadoop made this possible for any development group, with little initial16 investment, hence democratizing scalable data processing.

About support

Many Hadoop developers are paid by companies which use Hadoop and see its value. Corporate sponsorship is a catalyst for progress in open-source systems (see MySQL, Eclipse etc.). Hadoop got started as a component in the Nutch search engine, but it was Yahoo that invested resources, and helped make it a success story.

You can dig into Hadoop’s architecture and learn how it works (just like we did), or you could take advantage of the large ecosystem around it. The community report bugs and help people get started, there are books, and even paid support (look at Cloudera).

How does this relate to HBase? HBase is the Hadoop database. It has the best Hadoop integration. It uses HDFS for storage and MapReduce for distributed processing. Once you have a Hadoop cluster, you already have one half of an HBase cluster. It’s only natural that companies that are using Hadoop will be looking at HBase, if they aren’t already using it. And, following Hadoop’s model, they will invest resources and money, adding to HBase’s momentum. The companies that use HBase today sustain the core HBase development team. We too, are contributing back to both HBase and Hadoop. It’s only natural to invest in something that supports your business.


HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argument that “HBase is not a good choice because it is complex” is irrelevant. The advantages far outweigh the problems. Relying on decoupled components plays nice with the Unix philosophy: do one thing and do it well. Distributed storage is delegated to HDFS, so is distributed processing, cluster state goes to Zookeeper. All these systems are developed and tested separately, and are good at what they do. More than that, this allows you to scale your cluster on separate vectors. This is not optimal, but it allows for incremental investment in either spindles, CPU or RAM. You don’t have to add them all at the same time.

The HStack can be a pain to deploy. We took some time to understand the problem, and now we have Puppet recipes for everything. We can set up a cluster completely unattended. We’ll try to push all these back to the community and help other users have it easier, so stay tuned.

Zookeeper, Hadoop etc., let us implement transactions, simple queries and data processing. We want to have a system with such capabilities (these are requirements for large applications). Yes, you can drop some of them but you can’t drop them all. We don’t want a tool that’s missing too much. We want the good parts from an RDBMS, like queries and transactions, while still having distributed processing and cheap scalability. We don’t drink the “NoSQL” kool-aid. We’re not running away from SQL, we’re running towards something that is built from scratch on the premises of scalability and high availability.

About Community and Leadership

This is the biased part of the article, and it should be, because it’s about our relation with the HBase development team. Stack, Ryan, JD were always very receptive. They always help with the issues that we have, whether it’s a bug or a new feature that we need. There’s an open and democratic decision process when prioritizing work with HBase. The team is well balanced and there’s not a single company that drives HBase’s direction.

They are genuinely passionate about their work and strive to have it used by people. We attended one of their regular developer meetups and it was eye opening that developers coming from different backgrounds and companies are working together as a team. We think open-source projects benefit from good leadership and Michael Stack has done a great job with HBase.

Another aspect that appeals to us is the maturity of the development team. They focus on long term benefits. For example, the current focus is to improve the architecture of HMaster and multi-datacenter replication. However, in light of recent performance benchmark reports they took the time to understand the situation, validated with the community that it’s OK to stick to the current plan and didn’t switch focus.

Maturity is also shown in the way that the team positions HBase in relation to other competing projects; they let facts speak rather than opinions. They don’t engage in holy wars and this, to us, seems the right way to build a healthy community.

In the end we’d hope technologies wouldn’t be dismissed based on superficial or biased perception, FUD, or tweets. We don’t like it when talks are based on assumptions without knowing ALL the details of a certain problem or technical choice, and this seems to become a vicious trend in some circles. Hopefully, the reasons explained in this article can help you make your own informed assessments, and see what works for you.


  1. If anyone knows how to remotely “break” a network card, or RAM stick, please, let us know :) 
  2. Ever heard of Sidekick? http://www.sophos.com/blogs/gc/g/2009/10/13/catastrophic-data-loss-sidekick-users/
  3. By clients we mean our internal clients. Even though they have public data, our system is not publicly available. 
  4. “3″ is also something configurable, it’s the default replication factor in HDFS 
  5. This behavior is only available on HDFS version 0.21.0, or 0.20.4 with patches. Take a look at HDFS-826
  6. By rather small, we mean that it’s that small that even if you use 4 replicas, the “cost” surpasses the benefit. 
  7. Depends on cluster load and configuration. It takes ~40 seconds for 800 regions out of 5300 when 1 out of 7 regionserver dies in our test. We used hbase.regions.percheckin at 100. We’ll do some thorough measuring as well and document it. 
  8. it’s usually a minute or two, but depends on how many regions need to be reassigned by HMaster. We’ll get back with some metrics on this too. 
  9. HBase 0.20 and 0.21 
  10. It’s a fact that if you have more data than RAM, uniform random read latency approaches the storage latency, at best. Our average response time is 7ms for a 10K RPMS SATA disk. We haven’t tested SSDs yet, because they are not economically viable for us right now. 
  11. We tested with approx 3B rows (approx two orders of magnitude more data than available RAM – so data wasn’t served from the cache). See the cluster configuration in this article
  12. Why do you need heavy write performance? See here for a description of Farmville’s architecture
  13. This behavior is only available on HDFS version 0.21.0, or 0.20.4 with patches. Take a look at HDFS-826
  14. We don’t want to push the analogy too far, but multi-threaded programming does not yet offer a simple and clear programming model: threading, actors, STM, etc. There is no clear winner, and they all make the application code complex. 
  15. See the Hadoop “Powered by”, as well as the Hadoop Summit proceedings for more companies that are using Hadoop : Visa, IBM, Reuters, NY Times, etc. 
  16. Of course, TANSTAAFL, if you want to do heavy processing, you need beefy machines, and lots of them. 

[repost]Why we’re using HBase: Part 1 –adobe


Our team builds infrastructure services for many clients across Adobe. We have services ranging from commenting and tagging to structured data storage and processing. We need to make sure that data is safe and always available; the services have to work fast regardless of the data volume.

This article is about how we got started using HBase and where we are now. More in depth reasoning can be found in the second part of the article

Lucky shot

If one would have asked me a couple of days ago why or how we chose HBase, I would have answered in a blink that it was about reliability, performance, costs, etc.(a bit brainwashed after answering “correctly” and “objectively” too many times). However, as the subject has become rather popular lately1, I reflected deeper about “how” and “why”.

The truth is that, in the beginning, we were attracted to working with bleeding edge technology and it was fun. It was a projection of the success we were hoping to have that motivated us. We all knew stories about Google File System, Bigtable, GMail and what made them possible. I guess we wanted a piece of that, and Hadoop and HBase were one logical step to reach that.

We didn’t even have a cluster when we started. I begged and bribed for hardware from teams that had extra cycles on their testing machines. We were going to use them just as SETI@Home does, well sort of. Once we got 7 machines, we had a cluster2 running Hadoop and HBase stack (HStack3). We even went on and refurbished some old broken machines to work as extra test agents, besides our own laptops.

Technology driven decisions tend to fall over when assessed from a business perspective. I never thought about costs, data loss, etc. We were somehow assuming that these were all fine. If others ran it why wouldn’t we be able to do it? We knew this architecture would enable scalability, but we didn’t challenge whether the implementation actually works.

Scalability and performance “lured” us in. But in reality it’s the implementation that dictates costs, consistency, availability and performance. A good and scalable architecture is just a long term promise, unless it is backed up by the implementation. In our case the architecture choice paid off, as you’ll see.

Once we realized the potential of HBase through early experiments, we subjected it to a full analysis. It took a while to get an objective opinion, but after all the tests, we really knew we were on to something.

The 40M mark

We had already scaled MySQL, so denormalization, data partitioning and replication weren’t all that new to us. When mid-2008 one of our clients asked us to provide a service that could handle 40M records, real time access, aggregation and all that, we thought we had an answer. This was our first step towards doing “big data”.

There were no benchmarking reports4 then, no “NoSQL” moniker, therefore no hype :). We had to do our own objective assessment. The list was (luckily) short: HBase, Hypertable and Cassandra were on the table. MySQL was there as a last resort.

We abstracted the implementation details, and made a stub (so that the clients could start developing), and we started testing each technology stack.

Cassandra was out first. It had just come out, was barely usable, lacked any decent resources or active mailing lists and it could keep only one table per instance. This has changed a lot in time, but we had a deadline then.

Funny enough, HBase was the second one out :).

When we started pushing 40 million records, HBase5 squeaked and cracked. After 20M inserts it failed so bad it wouldn’t respond or restart, it mangled the data completely and we had to start over. It was performing bad and seemed to lose data.

I literally dreamt logs for a week, trying to identify the issues. It was looking as if we would discard HBase, but I insisted we should be able to switch later even if we went ahead with Hypertable. The team agreed, even though it didn’t look as a viable choice in comparison with Hypertable that handled the data rather well.

HBase community turned out to be great, they jumped and helped us, and upgrading to a new HBase version fixed our problems. Hypertable6 on the other hand seemed to perform better.

Plan for failure

When testing failover scenarios, HBase started to gain ground, handling node failures gracefully and consistently. Hypertable had problems bringing data back up after node failures and required manual intervention. We were, once more, left with a single option.

But there were more questions to be answered, concerns that we never had with MySQL:

What’s the guarantee that every bit you put in comes back in the same form no matter what?

We had to be able to detect corruption and fix it. As we had an encryption expert in the team (who authored a watermarking attack), he designed a model that would check consistency on-the-fly with CRC checksums and allow versioning. The thrift serialized data was wrapped in another layer that contained both the inner data types and the HBase location (row, family and qualifier). (He’s a bit paranoid sometimes, but that tends to come in handy when disaster strikes :). Pretty much what Avro does.

Another scenario we had to cover was a total failure of the Hadoop Distributed File System (HDFS) that HBase relies on.

We adapted an outdated HBase export tool to ensure data consistency during and after backup. We kept backups in 3 places: locally, HDFS and another distributed file system. We also had it prepared in MySQL format so we could switch to a MySQL cluster in case of disaster. We had scripts that would take the latest backup and bring up an alternative storage cluster (HBase or MySQL).

Things went pretty smoothly. We created MapReduce jobs to compute recommendations out of the data that was stored, we had an automated backup, and a mechanism for disaster recovery.

In October 2008 our system went live on time.

Disasters WILL happen

On the 3rd of December 2008, around midnight7, sanity alerts started pouring in: the service was running in degraded mode. Our HBase cluster would write data but couldn’t answer correctly to reads. Following the procedure, I was able to make another backup and restore it on a MySQL cluster. Then I enabled the MySQL cluster backup. We were paranoid enough to have a backup procedure for the MySQL “backup” cluster as well :)

We were up and running about half an hour later. MySQL had saved the day, and we congratulated ourselves. Except that our thorough, tested backup plan had a glitch: the master-master replication was not setup for the new tables. We shortly got two different data sets on each server. We had to stop the replication, switch to a single node and fix the consistency issue. Master-master replication is pretty much a hack and needs proper care. Otherwise, it’s pretty easy to screw up your data.

After a thorough investigation8, a postmortem and a couple of patches that brought both HBase and HDFS to the latest released version, we switched back to HStack9 on the 5th of December. We had no data loss and our clients only experienced a short interruption. Our client team congratulated us for the fast response.

Most importantly, this was our first reality check and a new lesson learnt. The system still runs today and we had no problems with it ever since. After just a few months our biggest client switched direction and never got to the 40 million records.

The system never reached its planned capacity. Even though we took up new clients on board and implemented new services on top of it, we weren’t yet the “stars” we’d imagined to be.

In reality, all we had would have been easy to handle with a MySQL cluster and just a little operational overhead.

The Billion Mark

We decided to switch focus in the beginning of 2009. We were going to provide a generic, real-time, structured data storage and processing system that could handle any data volume. By this time we caught the attention of bigger potential clients and the requirements changed a bit. 40 million became 1 billion, with access times under 50ms and serious processing power. All this with no downtime and definitely no data loss.

This time, we were going to do it right:

We wrote down all the failure scenarios10 that we could think of: bad disk, bad memory, packet loss, network card failure, machine and rack power failure, disk failure, raid controller failure, etc. Nothing was “sacred”. At one point, in sprint demos we would ask for two random numbers (two machines in the cluster) and unplugged the power cable, unplugged the network cable, or just randomly took out hard drives while running.

We used DRBD and Heartbeat to have HA for the Hadoop NameNode, because this was the last Single Point of Failure (SPOF) in our system.

We initially failed to reach 1B records as we filled up all the disks after less than 500M records. However, the performance and resilience to failures under rough conditions helped us “bootstrap” for something bigger. So, we got a new cluster11 installed, that could handle the necessary capacity.

We also looked at the operational overhead; we always try to automate as much as possible. Our latest deployment system took any number of barebone machines (using the IPMI interface) and would deploy the OS, partition everything and then deploy and configure Hadoop, HBase along with our systems, completely unattended. Everything was set up using a combination of Kickstart scripts and Puppet for service deployment. Even now, deploying a new cluster is basically a git push away.

We had a 3B record table, on which we ran all our benchmarks: cold start (no memory caches), reads, writes, combined tests in different proportions. Random reads under 15 ms, huge throughput, all the cool stuff you’d imagine.

We started contributing to HBase, HDFS and Map-Reduce to support all our failover and performance scenarios and make appropriate fixes when necessary.

Right now, our system is currently “beta” inside the organization. We’re still testing it, but already supports distributed data import, random access, distributed data processing, we have the APIs, web interfaces and it’s running fast.

This is a historical account on how we started working with Hadoop and HBase. The second part of this article will show more practical and objective aspects on why we stick with this technology stack.


  1. Especially in the “NoSQL” community. Check out a Google trend and a social media trend
  2. We’ve been watching Hadoop since 2007, but it was this article that triggered me to move from playing to actually deploy it on a cluster. On the 16th of April 2008 our first Hadopp/HBase cluster (called HStack) was operational.
  3. Our umbrella term for Hadoop, HBase, Zookeeper and friends.
  4. Like this. Mind you, we have sensible differences to these results, stay tuned :).
  5. Our first real tests were running against HBase-0.2.0. HBase later changed it’s versioning scheme to mirror Hadoop’s versions, so HBase-0.3 became HBase-0.18.
  6. Hypertable-0.9.27
  7. 11:51PM (GMT+2), to be more exact :)
  8. The problem had started at least a month before, when a bug silently disabled .META. compactions. So, the .META. table had a lot of StoreFiles and, from a point onward HDFS started to throttle open file handles. A full cluster restart would have temporarily fixed it and a proper log monitoring would have alerted us long before it got too late.
  9. Our umbrella term for Hadoop, HBase, Zookeeper and friends.
  10. Like this. This kind of research comes in very handy, when you try to see how expensive a failure is.
  11. 7 machines; each dual quad core hyper-threaded CPUs, 32GB RAM, 24 10K RPMS SATA disks, battery backed raid controllers, and IPMI. Commodity hardware does NOT mean crappy hardware. When you want performance, you need lots of spindles and memory.

[repost]1 Billion Reasons Why Adobe Chose HBase


Cosmin Lehene wrote two excellent articles on Adobe’s experiences with HBase: Why we’re using HBase: Part 1 and Why we’re using HBase: Part 2. Adobe needed a generic, real-time, structured data storage and processing system that could handle any data volume, with access times under 50ms, with no downtime and no data loss. The article goes into great detail about their experiences with HBase and their evaluation process, providing a “well reasoned impartial use case from a commercial user”. It talks about failure handling, availability, write performance, read performance, random reads, sequential scans, and consistency.

One of the knocks against HBase has been it’s complexity, as it has many parts that need installation and configuration. All is not lost according to the Adobe team:

HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argument that “HBase is not a good choice because it is complex” is irrelevant. The advantages far outweigh the problems. Relying on decoupled components plays nice with the Unix philosophy: do one thing and do it well. Distributed storage is delegated to HDFS, so is distributed processing, cluster state goes to Zookeeper. All these systems are developed and tested separately, and are good at what they do. More than that, this allows you to scale your cluster on separate vectors. This is not optimal, but it allows for incremental investment in either spindles, CPU or RAM. You don’t have to add them all at the same time.

Highly recommended, especially if you need some sort of balance to the recent gush of Cassandra articles.

About Desktop RIA comparison: Dekoh Versus Adobe Apollo


Dekoh consists of 3 components:

  1. Dekoh Desktop: Desktop RIA platform.
  2. Dekoh Applications: Dekoh ships applications that run on Dekoh Desktop.
  3. Dekoh Network: Enables secure sharing of Dekoh Desktop applications and content on the web. Viewers don’t need to install Dekoh software.

Adobe Apollo and Dekoh Desktop are both RIA platforms on the desktop. Apollo does not have an equivalent of Dekoh Applications and Dekoh Network. Hence, I compare Dekoh Desktop with Apollo in this article.

Installation and OS support

Dekoh and Apollo both are cross operating-system runtime that help developing and running RIA on desktop.

Feature Dekoh Desktop Apollo
Cross-OS achieved through Java web server Flash/Flex
Installation size 5MB 5-9MB
Single click installation from the web Yes Don’t know
Automatic updates Yes. Versioning API available to all applications Yes. Don’t know if API is available for applications
Startup Desktop icon, System tray, Windows startup Desktop icon, System tray, Windows startup
OS supported in version 1.0 release Windows, Mac, Linux Windows, Mac
Browsers supported IE (6 & 7), Firefox, Safari None. Home grown rendering engine based on Webkit

User Interface

Applications can leverage these technologies for rendering UI.

Feature Dekoh Desktop Apollo
Build using (any combination) HTML, DHTML, Javascript, Flash, AJAX, Java Applets HTML, DHTML, Javascript, Flash, AJAX
Reusable Widgets Yes No
Drag-and-Drop support Yes. Inside the browser. Yes
Special effects Some effects made available through JS/AJAX libraries Window transparency, rotation and many more
Browser-plugins All browser plugins work None.
Browser toolbars Yes No

Programming API

Applications can use these technologies for writing application logic.

Feature Dekoh Desktop Apollo
Programming language JSP, Servlet, Java Flex
Bundled database Yes No
Other database support Thru JDBC. Object persistence support thru JPA None
Web services access Yes Yes
Inter application communication Java or HTTP Inter-Application Communication (IAC) protocol
OS services, like filesystem access Thru Standard Java  packages and API Exposed thru Apollo API
Invoking other native DLLs, libraries JNI, Java-COM bridge None
Secure sandbox Warn users before installing unsigned applications Don’t know. (Not decided yet as per a product manager)


Feature Dekoh Desktop Apollo
Cost Free (as in beer) Free (as in beer)
License Open Source Proprietary
RSS support Yes No
Web 2.0 features like sharing, tagging, commenting Yes No
Share from desktop, applications or content with personal friend network Yes No

All data regarding Apollo has been gathered through Apollo Developer FAQ and publicly available presentations/videos.