How to setup Apache Pig with Apache Hadoop
Download Hadoop 1.0.0 and setup as a multi node cluster.
Download Apache Pig 0.9.1 and extract.
Export HADOOP_HOME – place you install Apache Hadoop
Start Apache Pig with mapreduce mode
bin/pig
You will get the grunt prompt
grunt>
Distributed grep using Hadoop
Hadoop word count example is commonly used to introduce mapreduce concepts. I have altered the word count sample to do pattern matching or work like UNIX grep command.
first copy the text file to HDFS location.
bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir> bin/hadoop jar <path>/grep.jar org.myorg.Grep <hdfs-input-dir> <hdfs-output-dir> <pattern>
package org.myorg;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Grep {
public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private Pattern pattern;
private int group;
public void configure(JobConf job) {
pattern = Pattern.compile(job.get("mapred.mapper.regex"));
group = job.getInt("mapred.mapper.regex.group", 0);
}
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
output.collect(new Text(line), one);
}
}
}
public static void main(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
return;
}
JobConf conf = new JobConf(Grep.class);
conf.setJobName("Grep");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.set("mapred.mapper.regex", args[2]);
if (args.length == 4) {
conf.set("mapred.mapper.regex.group", args[3]);
}
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Apache Hadoop 1.0.0 released
Apache Hadoop 1.0.0 released. Release Note
Hadoop 1.0.0 released from 0.20.2xx.x development tree. There is another new development in Hadoop space with version 0.23.0. New 0.23.0 version contains HDFS Federation and NextGen MapReduce (YARN).
Howto setup Xen 4.1 with Ubuntu 11.10 and run a VM (domU)
Ubuntu 11.10 released with the Linux 3.0.0. Now Linux has the Xen dom0 support.
setup steps:
1. Install Xen hypervisor
apt-get install xen-hypervisor-4.1-amd64
2. create a domU configuration. ( node0.cfg )
disk = [ 'file:/mnt/vm/xen/vm_images/node0.iso,hda,w', 'file:/media/mnt/vm/xen/iso_files/oneiric-desktop-i386.iso,hdc:cdrom,r' ] memory=1024 vcpus=1 name="node0" vif=[ 'type=ioemu,bridge=virbr0' ] builder = "hvm" device_model = "/usr/lib/xen-4.1/bin/qemu-dm" vnc=1 vncunused=1 apic=0 acpi=0 pae=0 serial = "pty" # enable serial console boot="dc" on_reboot = 'restart' on_crash = 'restart'
3. Start the domain
xm create /pathto/node0.cfg
In the first boot vm boots from cdrom and user can install new OS to vm.
by changing the boot order to boot=”c” user can boot from the hard disk.
4. List running domains.
xm list
PS:
Ubuntu 11.10 still need some path fixes to run user domains /vms ( domU ).
You may get an error in your /var/log/qemu-dm-node0.log saying :
Could not read keymap file: ‘/usr/share/qemu/keymaps/en-us’
create symbolic link named qemu-linaro in /usr/share with the name qemu.
For bridge networks default bridge may not include your default NIC.
If your vms ( domUs ) can not communicate with out side networks add
NIC to bridge.
brctl addif virbr0 eth0
Application Development with WSO2 Relational Storage Service ( WSO2 RSS )
WSO2 Relational Storage Service is a data storage service provided by WSO2 Stratoslive PaaS. WSO2 RSS supports MySQL and Amazon RDS as the back end data store.
Creating data bases with WSO2 RSS is a simple task. StratosLive Data Server has the easy RSS user interface that helps to add / manage databases.
Steps to create database using WSO2 RSS.
1. Add Database.
2. Create Database User and add user to a database privileged group.
3. Create tables / mange data using WSO2 RSS DB console.
RSS based data stores are accessible with in StratosLive PaaS.
Users can use Java application development methods to access RSS Data stores.
WSO2ConRSS application is a webapp deployed in StratosLive Application servers and it uses a RSS based data store to retrieve data. Source code related this sample available in OT svn.
How to use StratosLive Column ( Family Data ) Store Service.
WSO2 CSS is Column ( Family Data ) Store based on Apache Cassandra . WSO2 CSS can deploy with any WSO2 Carbon based product and it is available as a service in StratosLive the PaaS offering of WSO2.
It is very easy to use CSS as a data store with widely available connectors like java based Hector and other thrift based connectors. StratosLive supports Hector API to communicate with the Cassanda based back-end CSS cluster. External applications can use StratosLive PaaS column data store feature with any Cassandra connector.
StratosLive app developers have to use tenant information to authenticate in the connection with CSS data store. Tenant admin can create tenant and authorize the user for data store access.
Check the full sample in OT SVN.
This sample create connection to StratosLive CSS as an external application. It writes random data to StratosLive CSS keyspace and read and output date via stdout.
Instructions to build and run the sample.
Build the project with Maven
Take a copy of the source using svn
mvn clean install
Build the project with dependency libraries
mvn clean assembly:assembly -o
Execute the program
java -jar target/org.wso2.carbon.cassandra.examples-3.2.1-jar-with-dependencies.jar
Column ( Family Data ) Store Service in WSO2 StratosLive
StratosLive PaaS supports several internal data stores like column ( family data ) store service , relational data store service and external data sources like Amazon DS and Amazon S3. Also users can use external data sources via Web Services.
WSO2 introduces CSS in the StratosLive PaaS to support webscale data generated by users deployed applications and the PaaS itself.
WSO2 Stratos CSS is based on Apache Cassandra. Cassandra is modified to run in WSO2 Carbon platform which is an OSGI environment. Stratos CSS 1.0.0 is shipped with Stratos 1.5.1. Users can install it with WSO2 private cloud deployments. CSS related features can be deploy with any carbon standalone product and get full features.
StratosLive has separate CSS cluster deployed to store tenant keyspaces. StrtosLive Data Service Server ( DSS ) contains the user interfaces to manage keyspaces.
CSS is a multi-tenanted and it works with users in private Stratos deployments.
WSO2 CSS 1.0.0 features.
1. Manage (create / delete / modify ) keyspaces
2. Share Keyspaces with in users
3. Create Indexes
4. Monitor Keyspace
WSO2 CSS has easy user interface to manage keyspaces and users can use CSS to manage external keyspaces. Users can use WSO2 CSS as a Cassandra management user interface.
- List Keyspaces
- List Keyspace information
- Create a Keyspace for a tenant
- Create Column Family
- Create Column and Set Indexes
- Share Keyspace
WSO2 Stratos PaaS Column ( Family ) data support will improve with the CSS based data services and CQL support in next CSS releases.
WSO2Con 2011
WSO2Con 2011 is happening in Colombo Sri Lanka between Sep 12 – 16 at Waters Edge. WSO2Con 2011 is the second WSO2 developer conference and it is more focus on new WSO2 PaaS offering named Stratos and the hosted service StratosLive.
WSO2Con 2011 main conference starts on 13th Sep and ends on 15th Sep. There are pre-conference tutorials and post conference tutorials to get real experience of the WSO2 product stack. Each conference day has two tracks therefore participants can select talks based on the interests. Check the conference agenda early and select the track.
In this year conference WSO2 customers and partners are presenting their solutions developed with WSO2 products.











