POC #: Analyse social bookmarking sites to find insights
POC #: Analyse social bookmarking sites to find insights
Industry: Social Media
Data: It comprises of the information gathered from sites
which are bookmarking sites and allow you to bookmark, review, rate, on specific
topic. A bookmarking site allows you to bookmark, review, rate, search various
links on any topic. The data is in XML format and contains various categories defining it and the ratings linked with it.
Problem Statement: Analyse the data in Hadoop Eco-system to:
1. Fetch the data into Hadoop Distributed File System and
analyze it with the help of MapReduce, Pig and Hive to find the top rated links
based on the user comments, likes etc.
2. Using MapReduce convert the semi-structured format (XML
data) into structured
3. Push the (MapReduce) output HDFS and then feed it into
PIG, which splits the data into two parts: Category data and Ratings data.
4. Write a fancy Hive Query to analyze the data further and
push the output is into relational database (RDBMS) using Sqoop.
POC Coding Details
Input Data (XML Files)
MapReduce Code to convert XML File to Flat File
or Comma Separated File.
MyMapper.java
import
java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import
javax.xml.parsers.DocumentBuilder;
import
javax.xml.parsers.DocumentBuilderFactory;
import
org.apache.commons.logging.Log;
import
org.apache.commons.logging.LogFactory;
import
org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import
org.w3c.dom.NodeList;
public class MyMapper extends
Mapper {
private static final Log LOG = LogFactory.getLog(MyMapper.class);
//
Fprivate Text videoName = new Text();
public void map(LongWritable
key, Text value, Context context) throws IOException, InterruptedException {
try {
InputStream is = new
ByteArrayInputStream(value.toString().getBytes());
DocumentBuilderFactory dbFactory =
DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
doc.getDocumentElement().normalize();
NodeList nList =
doc.getElementsByTagName("book");
for (int temp = 0; temp <
nList.getLength(); temp++) {
Node nNode = nList.item(temp);
if (nNode.getNodeType()
== Node.ELEMENT_NODE) {
Element eElement =
(Element) nNode;
String id =
eElement.getElementsByTagName("id").item(0).getTextContent();
String author =
eElement.getElementsByTagName("author").item(0).getTextContent();
String title =
eElement.getElementsByTagName("title").item(0).getTextContent();
String genre =
eElement.getElementsByTagName("genre").item(0).getTextContent();
String price =
eElement.getElementsByTagName("price").item(0).getTextContent();
String publish_date =
eElement.getElementsByTagName("publish_date").item(0).getTextContent();
String descriptions =
eElement.getElementsByTagName("descriptions").item(0).getTextContent();
String review =
eElement.getElementsByTagName("review").item(0).getTextContent();
String rate =
eElement.getElementsByTagName("rate").item(0).getTextContent();
String comments =
eElement.getElementsByTagName("comments").item(0).getTextContent();
context.write(new Text(id + "," + author + "," + title + "," + genre + "," + price + "," + publish_date + "," + descriptions + "," + review + "," + rate + "," + comments),
NullWritable.get());
}
}
} catch (Exception e) {
throw new IOException(e);
}
}
}
XMLDriver.java
import java.io.IOException;
import javax.xml.stream.XMLInputFactory;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import
org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import
org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapreduce.*;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.*;
import
org.apache.hadoop.util.GenericOptionsParser;
public class XMLDriver {
/**
Bhavesh - for processing XML file using Hadoop MapReduce
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
try {
Configuration conf = new Configuration();
Strin
Data: It comprises of the information gathered from sites
which are bookmarking sites and allow you to bookmark, review, rate, on specific
topic. A bookmarking site allows you to bookmark, review, rate, search various
links on any topic. The data is in XML format and contains various categories defining it and the ratings linked with it.
Problem Statement: Analyse the data in Hadoop Eco-system to:
1. Fetch the data into Hadoop Distributed File System and
analyze it with the help of MapReduce, Pig and Hive to find the top rated links
based on the user comments, likes etc.
2. Using MapReduce convert the semi-structured format (XML
data) into structured
3. Push the (MapReduce) output HDFS and then feed it into
PIG, which splits the data into two parts: Category data and Ratings data.
4. Write a fancy Hive Query to analyze the data further and
push the output is into relational database (RDBMS) using Sqoop.
POC Coding Details
Input Data (XML Files)
MapReduce Code to convert XML File to Flat File
or Comma Separated File.