Tag Archives: Apache

[总结] Hive 性能优化

使用 Hadoop Hive 设计性能

http://kb.tableau.com/articles/knowledgebase/hadoop-hive-performance?lang=zh-cn

可通过一些技术来改进可视化以及依据 Hadoop 群集中存储的数据构建的仪表板的性能。尽管 Hadoop 是面向批处理的系统,但下面的建议可通过工作负载调整、Tableau 数据引擎的优化提示来减少延迟。

提高连接性能

用于限制数据集大小的自定义 SQL

自定义 SQL 允许使用复杂 SQL 表达式作为 Tableau 中连接的基础。通过在自定义 SQL 中使用 LIMIT 子句,您可以减小数据集大小以加快浏览新数据集和建立视图的速度。稍后可以移除此 LIMIT 子句以支持对整个数据集进行实时查询。

可以轻松地开始使用自定义 SQL 来限制数据集大小。如果您的连接是单表或多表连接,则可以将其切换到自定义 SQL 连接并让连接对话框自动填充自定义 SQL 表达式。在自定义 SQL 的最后一行中,添加“LIMIT 10000”以便仅使用前 10,000 条记录。

数据提取

在处理大量数据时,Tableau 数据引擎是功能强大的加速器,并支持以低延迟进行临时分析。尽管 Tableau 数据引擎不是针对 Hadoop 所具有的相同标度构建的,但它能够处理包含多个字段和数亿行的广泛数据集。

通过在 Tableau 中创建数据提取,将能通过将海量数据压缩为小很多的数据集(其中可能包含最相关的特征)来加快数据分析速度。在创建数据提取时,请利用位于“提取数据”对话框中的以下选项:

  • 隐藏未使用的字段: 忽略 Tableau 的“数据”窗口中已隐藏的字段,以使数据提取紧凑简洁。
  • 聚合可视维度: 创建已将数据预先聚合到粗粒度视图的数据提取。尽管 Hadoop 非常适合于存储各个细粒度数据目标点,但更广泛的数据视图可实现大致相同的深入分析,而计算开销却小得多。
  • 汇总日期: Hadoop 日期/时间数据是细粒度数据的特定示例,如果将其汇总到粗粒度的时间表,这些数据将能更好地发挥作用。例如,跟踪每小时(而不是每毫秒)的事件。
  • 定义筛选器: 创建一个筛选器以仅保留感兴趣的数据。例如,您在处理存档数据,但只对最近的记录感兴趣。

高级性能技术

下面是一些性能技术,这些技术要求对 Hive 有更深入的了解。请参考本文的作为管理员来提高性能部分,了解有关在 Hive 中创建表时管理员应考虑的详细信息的详情。

在对传统数据库技术(例如查询优化器和索引)提供一定程度支持的同时,Hive 还提供了一些独有的方法来提高查询性能。

筛选器形式的分区字段

Hive 中的表可定义分区字段,分区字段基于字段的值将磁盘上的记录分为分区。如果查询包含针对该分区的筛选器,Hive 可快速隔离满足查询所需的数据块的子集。通过在 Tableau 中创建针对一个或多个分区字段的筛选器,可以大幅缩短查询执行时间。

这种 Hive 查询优化的一个已知限制是:分区筛选器必须与分区字段中的数据完全匹配。例如,某个字符串字段包含日期,则无法依据 YEAR([date_field])=2011 进行筛选。相反,请考虑从原始数据值的角度来表达筛选器,例如,[date_field] >= '2011-01-01'。更广泛地说,您不能利用基于从上述字段派生的计算或表达式的分区筛选,而必须使用文本值直接筛选字段。

分组字段形式的群集字段

群集化的字段(有时称为分组字段)可决定磁盘上表数据的分隔方式。一个或多个字段定义为表的群集字段,并且这些字段合并的指纹确保具有相同群集字段内容的所有行在数据块内紧密相邻。

此指纹称为哈希,通过两种方式提高了查询性能。首先,跨群集字段的计算聚合可以利用称为合并器的“地图/减少”管道中的早期阶段,从而可通过为每个“减少”发送较少的数据来减少网络流量。当您在 GROUP BY 子句中使用群集字段时,此方式最有效,您将会发现这些字段与 Tableau 中的离散(蓝色)字段(通常为维度)相关联。

群集字段哈希可提高性能的另一种方式是使用联接。称为哈希联接的联接优化允许 Hadoop 基于预先计算出的哈希值将两个数据集快速合并,条件是联接键也是群集字段。由于群集字段确保数据块是基于哈希值进行组织的,因此哈希联接对于磁盘 I/O 和网络带宽而言变得更加高效,原因是它可以针对位于相同位置的大型数据块进行操作。

初始 SQL

初始 SQL 为在建立连接时立即设置配置参数和执行工作提供了开放的可能性。此部分将更详细地论述如何使用初始 SQL 进行高级性能优化。内容不可能面面俱到;有很多性能优化选项可能会因为群集大小以及数据的类型和大小在实用工具中发生变化。

  • 提高并行度

第一个优化示例使用初始 SQL 强制提高为 Tableau 分析生成的作业的并发度。默认情况下,并发度由数据集大小和 64 MB 的默认块大小决定。仅包含 128 MB 数据的数据集只会在针对该数据的任何查询开始时并发执行两个映射任务。对于需要计算密集型分析任务的数据集,您可以通过降低单一工作单元所需的数据集大小阈值来强制提高并行度。以下设置使用 1 MB 的拆分大小,有可能将并发吞吐量提高 64 倍:

  • set mapred.max.split.size=1000000;

 

  • 优化联接性能

下一个示例对上面有关使用群集字段(分组字段)提高联接性能的论述进行了扩展。对于 Hive 的许多版本,该优化默认情况下已关闭。使用以下设置启用优化。

  • set hive.optimize.bucketmapjoin=true;
  • set hive.optimize.bucketmapjoin.sortedmerge=true;

注意:第二个设置利用同时进行了排序的群集字段。

  • 为不均匀分布调整配置

最后这个示例演示配置设置有时会对数据的形状产生怎样的影响。在处理分布高度不均匀的数据(例如按流量来源列出的 Web 流量)时,“地图/减少”的特性可能会导致巨大的数据偏斜,在其中少量的计算节点必须处理大量的计算。以下设置告知 Hive 数据可能已偏斜,并且 Hive 应采用不同的方法规划“地图/减少”作业。对于未严重偏斜的数据,这设置可能会降低性能。

  • set hive.groupby.skewindata=true;

作为管理员来提高性能

Hive 不会改变 Hadoop 作为一个批处理系统的基本特性。不过,有一些技术可通过 Hive 提高群集的效率。

存储文件格式

Hive 可以灵活地按原样处理数据文件。Hive 中的外部表可以引用位于 Hive 环境外部的分布式文件系统数据。该外部表可指明数据的压缩方式以及应对其采用的分析方法。每个 Hive 查询随后可对数据进行动态解压缩和分析。这是灵活性和效率之间的一个权衡。

Hive 表还可以采用对频繁进行或复杂的分析任务更高效的存储格式捕获数据集。默认存储格式为序列文件,但也存在许多其他格式。为了与诸如 Tableau 等分析工具配合使用,您可能宁愿使用记录分列文件格式 (RCFile),此格式是一种混合行分列格式,如果只需要数据的子集,它可以支持进行有效的分析。

若干数据格式对压缩提供本机支持。其他数据格式与特定压缩格式(如 BZ2)兼容。您也可以将 Hive 配置为压缩从一个计算节点共享到网络上的下一个计算节点的中间数据。每种类型的压缩都是 CPU 时间和磁盘或网络 I/O 之间的权衡。针对这一点有许多准则,但由于群集、数据和分析用例的多变程度,因此对于什么压缩方案较为适当并没有明确的规则。

分区

您可以在分布式文件系统中将 Hive 表组织为单独的文件(其中每个文件都包含多个数据块),以便控制数据接近性来实现高效访问。可通过将一个或多个字段定义为分区字段来达到此目的。然后,字段值的每个唯一组合将在 HDFS 中生成一个单独的文件。通常,将使用日期字段进行分区以确保日期相同的记录保持在一起。当 Hive 查询使用 WHERE 子句来筛选分区字段中的数据时,筛选器将有效地描述相关的数据文件。通过仅从文件系统中加载那些文件,查询的执行速度可能要比按非分区字段筛选的普通查询快很多。

分组

与分区类似,分组字段(即群集字段)定义在分布式文件系统中将表数据组织为单独文件的方式。将使用一个或多个分组字段中的数据值来计算哈希。然后,将依据数据所属的哈希桶在 Hive 表中定义的固定总计范围外将数据分为文件。

在“地图/减少”操作中,“地图”阶段将基于哈希值组织数据。由于哈希是预先计算的,并且数据按哈希值组织,因此分组可以大幅加快“地图”阶段的速度并减少“地图/减少”阶段之间的数据散布,而数据散布会加重网络 I/O 的负担。就 SQL 而言,最终结果是加快了 GROUP BY 和联接运算的速度。

 

[repost ]Data Processing API in Apache Tez

original:http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/

This post is the second in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

Overview

Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing. Thus user logic, that analyses and modifies the data, sits in the vertices. Edges determine the consumer of the data, how the data is transferred and the dependency between the producer and consumer vertices. This model concisely captures the logical definition of the computation. When the Tez job executes on the cluster, it expands this logical graph into a physical graph by adding parallelism at the vertices to scale to the data size being processed. Multiple tasks are created per logical vertex to perform the computation in parallel.

DAG Definition API

More technically, the data processing is expressed in the form of a directed acyclic graph (DAG). The processing starts at the root vertices of the DAG and continues down the directed edges till it reaches the leaf vertices. When all the vertices in the DAG have completed then the data processing job is done. The graph does not have cycles because the fault tolerance mechanism used by Tez is re-execution of failed tasks. When the input to a task is lost then the producer task of the input is re-executed and so Tez needs to be able to walk up the graph edges to locate a non-failed task from which to re-start the computation. Cycles in the graph can make this walk difficult to perform. In some cases, cycles may be handled by unrolling them to create a DAG.

tez1Tez defines a simple Java API to express a DAG of data processing. The API has three components

  • DAG. this defines the overall job. The user creates a DAG object for each data processing job.
  • Vertex. this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
  • Edge. this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.

The diagram shows a dataflow graph and its definition using the DAG API (simplified). The job consists of 2 vertices performing a “Map” operation on 2 datasets. Their output is consumed by 2 vertices that do a “Reduce” operation. Their output is brought together in the last vertex that does a “Join” operation.

tez2Tez handles expanding this logical graph at runtime to perform the operations in parallel using multiple tasks. The diagram shows a runtime expansion in which the first M-R pair has a parallelism of 2 while the second has a parallelism of 3. Both branches of computation merge in the Join operation that has a parallelism of 2. Edge properties are at the heart of this runtime activity.

Edge Properties

The following edge properties enable Tez to instantiate the tasks, configure their inputs and outputs, schedule them appropriately and help route the data between the tasks. The parallelism for each vertex is determined based on user guidance, data size and resources.

  • tez3Data movement. Defines routing of data between tasks
    • One-To-One: Data from the ith producer task routes to the ith consumer task.
    • Broadcast: Data from a producer task routes to all consumer tasks.
    • Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ithconsumer task.
  • Scheduling. Defines when a consumer task is scheduled
    • Sequential: Consumer task may be scheduled after a producer task completes.
    • Concurrent: Consumer task must be co-scheduled with a producer task.
  • Data source. Defines the lifetime/reliability of a task output
    • Persisted: Output will be available after the task exits. Output may be lost later on.
    • Persisted-Reliable: Output is reliably stored and will always be available
    • Ephemeral: Output is available only while the producer task is running

Some real life use cases will help in clarifying the edge properties. Mapreduce would be expressed with the scatter-gather, sequential and persisted edge properties. Map tasks scatter partitions and reduce tasks gather them. Reduce tasks are scheduled after the map tasks complete and the map task outputs are written to local disk and hence available after the map tasks have completed. When a vertex checkpoints its output into HDFS then its output edge has a persisted-reliable property. If a producer vertex is streaming data directly to a consumer vertex then the edge between them has ephemeral and concurrent properties. A broadcast property is used on a sampler vertex that produces a global histogram of data ranges for range partitioning.

We hope that the Tez dataflow definition API will be able to express a broad spectrum of data processing topologies and enable higher level languages to elegantly transform their queries into Tez jobs.

Find more about Apache Tez here.

[repost ]Apache Tez: Dynamic Graph Reconfiguration

original:http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/

This post is the fifth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

Case Study: Automatic Reduce Parallelism

Motivation

tez1Distributed data processing is dynamic by nature and it is extremely difficult to statically determine optimal concurrency and data movement methods a priori. More information is available during runtime, like data samples and sizes, which may help optimize the execution plan further. We also recognize that Tez by itself cannot always have the smarts to perform these dynamic optimizations. The design of Tez includes support for pluggable vertex management modules to collect relevant information from tasks and change the dataflow graph at runtime to optimize for performance and resource usage. The diagram shows how we can determine an appropriate number of reducers in a MapReduce like job by observing the actual data output produced and the desired load per reduce task.

Performance & Efficiency via Dynamic Graph Reconfiguration

Tez envisions running computation by the most resource efficient and high-performance means possible given the runtime conditions in the cluster and the results of the previous steps of the computation. This functionality is constructed using a couple of basic building blocks

  • Pluggable Vertex Management Modules: The control flow architecture of Tez incorporates a per-vertex pluggable module for user logic that deeply understands the data and computation. The vertex state machine invokes this user module at significant transitions of the state machine such as vertex start, source task completion etc. At these points the user logic can examine the runtime state and provide hints to the main Tez execution engine on attributes like vertex task parallelism.
  • Event Flow Architecture: Tez defines a set of events by which different components like vertices, tasks etc. can pass information to each other. These events are routed from source to destination components based on a well-defined routing logic in the Tez control plane. One such event is the VertexManager event that can be used to send any kind of user-defined payload to the VertexManager of a given vertex.

Case Study: Reduce task parallelism and Reduce Slow-start

Determining the correct number of reduce tasks has been a long standing issue for Map Reduce jobs. The output produced by the map tasks is not known a priori and thus determining that number before job execution is hard. This becomes even more difficult when there are several stages of computation and the reduce parallelism needs to be determined for each stage. We take that as a case study to demonstrate the graph reconfiguration capabilities of Tez.

tez2Reduce Task Parallelism: Tez has a ShuffleVertexManager that understands the semantics of hash based partitioning performed over a shuffle transport layer that is used in MapReduce. Tez defines a VertexManager event that can be used to send an arbitrary user payload to the vertex manager of a given vertex. The partitioning tasks (say the Map tasks) use this event to send statistics such as the size of the output partitions produced to the ShuffleVertexManager for the reduce vertex. The manager receives these events and tries to model the final output statistics that would be produced by the all the tasks. It can then advise the vertex state machine of the Reduce vertex to decrease the parallelism of the vertex if needed. The idea being to first over-partition and then determine the correct number at runtime. The vertex controller can cancel extra tasks and proceed as usual.

tez3Reduce Slow-start/Pre-launch: Slow-start is a MapReduce feature where-in the reduce tasks are launched before all the map tasks complete. The hypothesis being that reduce tasks can start fetching the completed map outputs while the remaining map tasks complete. Determining when to pre-launch the reduce tasks is tricky because it depends on output data produced by the map tasks. It would be inefficient to run reduce tasks so early that they finish fetching the data and sit idle while the remaining maps are still running. In Tez, the slow-start logic is embedded in the ShuffleVertexManager. The vertex state controller informs the manager whenever a source task (here the Map task) completes. The manager uses this information to determine when to pre-launch the reduce tasks and how many to pre-launch. It then advises the vertex controller.

Its easy to see how the above can be extended to determine the correct parallelism for range-partitioning scenarios. The data samples could be sent via the VertexManager events to the vertex manager that can create the key-range histogram and determine the correct number of partitions. It can then assign the appropriate key-ranges to each partition. Thus, in Tez, this operation could be achieved without the overhead of a separate sampling job.

Learn more about Apache Tez here. 

[repost ]Re-Using Containers in Apache Tez

original:http://hortonworks.com/blog/re-using-containers-in-apache-tez/

This post is the sixth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

Motivation

Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes viaYARN, on the users’ behalf – for isolation, among other reasons. This model comes with inherent costs – some of which are listed below.

  • Process startup and initialization cost, especially when running a Java process is fairly high. For short running tasks, this initialization cost ends up being a significant fraction of the actual task runtime. Re-using containers can significantly reduce this cost.
  • Stragglers have typically been another problem for jobs – where a job runtime is limited by the slowest running task. With reduced static costs per tasks – it becomes possible to run more tasks, each with a smaller work-unit. This reduces the runtime of stragglers (smaller work-unit), while allowing faster tasks to process additional work-units which can overlap with the stragglers.
  • Re-using containers has the additional advantage of not needing to allocate each container via the YARN ResourceManager (RM).

Other than helping solve some of the existing concerns, re-using containers provide additional opportunities for optimization where data can be shared between tasks.

Consideration for Re-Using Containers

Compatibility of containers

Each vertex in Tez specifies parameters, which are used when launching containers. These include the requested resources (memory, CPU etc), YARN LocalResources, the environment, and the command line options for tasks belonging to this Vertex. When a container is first launched, it is launched for a specific task and uses the parameters specified for the task (or vertex) – this then becomes the container’s signature. An already running container is considered to be compatible for another task when the running container’s signature is a superset of what the task requires.

Scheduling

Initially, when no containers are available, the Tez AM will request containers from the RM with location information specified, and rely on YARN’s scheduler for locality-aware assignments. However, for containers which are being considered for re-use, the scheduling smarts offered by YARN are no longer available.

The Tez scheduler works with several parameters to take decisions on task assignments –  task-locality requirements, compatibility of containers as described above, total available resources on the cluster, and the priority of pending task requests.

When a task completes, and the container running the task becomes available for re-use – a task may not be assigned to it immediately – as tasks may not exist, for which the data is local to the container’s node. The Tez scheduler first makes an attempt to find a task for which the data would be local for the container. If no such task exists, the scheduler holds on to the container for a specific time, before actually allocating any pending tasks to this container. The expectation here, is that more tasks will complete – which gives additional opportunities for scheduling tasks on nodes which are close to the data. Going forward, non-local containers may be used in a speculative manner.

Priority of pending tasks (across different vertices), compatibility and cluster resources are considered to ensure that tasks which are deemed to be of higher priority (either due to a must-run-before relationship, failure, or due to specific scheduling policies) have an available container.

In the future, affinity will become part of the scheduling decision. This could be dictated by common resources shared between tasks, which need only be loaded by the first task running in a container, or by the data generated by the first task, which can then directly be processed by subsequent tasks, without needing to move/serialize the data – especially in the case of One-to-One edges.

Beyond simple JVM Re-Use

Cluster Dependent Work Allocation

At the moment, the number of tasks for a vertex, and their corresponding ‘work-units’ are determined up front. Going forward, this is likely to change to a model, where a certain number of tasks are setup up front based on cluster resources, but work-units for these tasks are determined at runtime. This allows additional optimizations where tasks which complete early are given additional work, and also allows for better locality-based assignment of work.

Object Registry

Each Tez JVM (or container) contains an object cache, which can be used to share data between different tasks running within the same container. This is a simple Key-Object store, with different levels of visibility/retention. Objects can be cached for use within tasks belonging to the same Vertex, for all tasks within a DAG, and for tasks running across a Tez Session (more on Sessions in a subsequent post). The resources being cached may, in the future, be made available as a hint to the Tez Scheduler for affinity based scheduling.

Examples of usage:

tez11) Hive makes use of this object registry to cache data for Broadcast Joins, which is fetched and computed once by the first task, and used directly by remaining tasks which run in the same JVM.

2) The sort buffer used by OnFileSortedOutput can be cached, and re-used across tasks.

Learn more about Apache Tez here. 

[repost ]Apache Tez—对MapReduce数据处理的归纳

original:http://www.infoq.com/cn/news/2013/09/TEZ

最近的一篇InfoQ文章中曾讨论过,Hortonworks新的Stinger Initiative非常依赖Tez——一个全新Hadoop数据处理框架。

博客文章Apache Tez:Hadoop数据处理的新篇章中写道:

“诸如Hive和Pig等更高级别的数据处理应用,需要这样的一个执行框架:该框架能够用有效的方式,表达这些应用的复杂的查询逻辑,并且在执行查询时能够保证高性能。Apache Tez……给出了传统MapReduce的一种替代方案,让任务能够满足对快速响应时间和PB量级的极端吞吐量的需求。”

为了实现这一目标,Tez并没有将数据处理按照单任务建模,而是作为一种数据流图来处理:

……图中的顶点表示应用逻辑,而边则表示数据转移。丰富的数据流定义API,让用户能够用直观的方式表达复杂的查询逻辑。对于更高级别的声明式应用程序(如Hive和Pig)所生成的查询计划来说,这简直是一种天作之合……数据流管道可以被表示为单一的Tez任务,它会运行整个计算。而Tez负责将这个逻辑图扩展为任务的物理图,并执行它。

在Tez的顶点上,特定的用户逻辑以输入、处理器和输出模块的形式建模,输入和输出模块定义了输入和输出数据(包括格式、访问方法和位置),而处理器模块定义了数据转换逻辑——它可以用MapReduce任务或Reducer的形式表示。虽然Tez并不明确地强制要求任何数据格式的限制,但它需要输入、输出和处理器能够互相兼容。类似地,由一条边连接的输入/输出对,在格式/位置上必须是兼容的。

博客文章Apache Tez中的数据处理API,描绘了一套简单的Java API,用于表示数据处理的DAG(有向无环图)。该API包含三部分:

  • DAG定义了全体任务。用户为每个数据处理任务创建DAG对象。
  • Vertex定义了用户逻辑,以及执行该用户逻辑所需的资源与环境。用户为任务中的每一步创建Vertex对象,并将其添加到DAG。
  • Edge定义了生产者和消费者顶点之间的链接。用户创建Edge对象,用来连接生产者和消费者顶点。

Tez所定义的边属性,使其能够将用户任务实例化、配置其输入输出、恰当地调度它们,并定义任务之间的数据如何路由。Tez还支持通过指定用户指南、数据大小和资源,为每个顶点的执行定义其并发机制。

  • 数据转移:定义了任务之间数据的路由选择。
    • 一对一:数据从第i个生产者任务路由到第i个消费者任务。
    • 广播:数据从一个生产者任务路由到所有消费者任务。
    • 散列:生产者任务以碎片的形式散播数据,而消费者任务收集碎片。来自各个生产者任务的第i块碎片,都会路由到第i个消费者任务。
  • 调度:定义了一个消费者任务何时被设定为以下内容。
    • 顺序的:消费者任务被安排在某个生产者任务完成之后。
    • 并发的:消费者任务必须与某个生产者任务同时执行。
  • 数据源:将某个任务输出的生命周期/可靠性定义为如下内容。
    • 持续的:在任务推出后,输入将依旧可用——它或许在之后被丢弃。
    • 持久可靠:输入将被可靠地存储,而且将永远可用。
    • 短暂的:输出仅在生产者任务运行过程中可用,

有关Tez架构的更多细节,请参阅Tez设计文档

用数据流来表现数据处理的理念并不算新鲜——这正是Cascading的基础,而且许多使用Oozie的应用也实现了这一目的。相比之下,Tez的优势在于,将这一切都放在了一个单一的框架中,并针对资源管理(基于Apache Hadoop YARN)、数据传输和执行,对该框架进行了优化。此外,Tez的设计还提供了对可热插拔的顶点管理模块的支持,用来收集来自任务的相关信息,并在运行时改变数据流图,从而为了性能和资源使用进行优化。

查看英文原文:Apache Tez – a Generalization of the MapReduce Data Processing

[repost ]APACHE ACCUMULO EXAMPLES

original:http://accumulo.apache.org/1.5/examples/

Before running any of the examples, the following steps must be performed.

  1. Install and run Accumulo via the instructions found in $ACCUMULO_HOME/README. Remember the instance name. It will be referred to as “instance” throughout the examples. A comma-separated list of zookeeper servers will be referred to as “zookeepers”.
  2. Create an Accumulo user (see the user manual), or use the root user. The “username” Accumulo user name with password “password” is used throughout the examples. This user needs the ability to create tables.

In all commands, you will need to replace “instance”, “zookeepers”, “username”, and “password” with the values you set for your Accumulo instance.

Commands intended to be run in bash are prefixed by ‘$’. These are always assumed to be run from the $ACCUMULO_HOME directory.

Commands intended to be run in the Accumulo shell are prefixed by ‘>’.

Each README in the examples directory highlights the use of particular features of Apache Accumulo.

batch: Using the batch writer and batch scanner.

bloom: Creating a bloom filter enabled table to increase query performance.

bulkIngest: Ingesting bulk data using map/reduce jobs on Hadoop.

classpath

client

combiner: Using example StatsCombiner to find min, max, sum, and count.

constraints: Using constraints with tables.

dirlist: Storing filesystem information.

export

filedata: Storing file data.

filter: Using the AgeOffFilter to remove records more than 30 seconds old.

helloworld: Inserting records both inside map/reduce jobs and outside. And reading records between two rows.

isolation: Using the isolated scanner to ensure partial changes are not seen.

mapred: Using MapReduce to read from and write to Accumulo tables.

maxmutation: Limiting mutation size to avoid running out of memory.

regex

rowhash

shard: Using the intersecting iterator with a term index partitioned by document.

tabletofile

terasort

visibility: Using visibilities (or combinations of authorizations). Also shows user permissions.