This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies.
Hadoop
Proof of Concept or POC on Customer Complaints Analysis
POC #: Customer Complaints Analysis The POC is based on Consumer Complains r...
Proof of Concept or POC on Youtube Data Analysis
POC #: Youtube Data Analysis The POC is based on Youtube data. Public DATA...
POC #: Analyse social bookmarking sites to find insights
Industry: Social Media Data: It comprises of the information gathered from...
POC #: Generate Analytics from a Product based Company ...
POC #: Generate Analytics from a Product based Company Web Log. The POC is ...
POC #: Sensex Log Data Processing (PDF File Processing ...
Industry: Financial Data: Input Format - .PDF (Our Input Data is in PDF Forma...
How to Analyze Data in Apache Spark
In this activity, you we will load data into Apache Spark and inspect the data u...
Analytics on India census using Spark
POC#: Analytics on India census using Spark In this article, I have explored Ce...
Sentiment Analysis on Demonetization(India)
POC#: Sentiment Analysis on demonetization in India using Spark In this arti...
Pig : Data types and Operators
Data types: simple data types: --------------------- int --> 32 bit int...
Pig : How to perform grouping by Multiple Columns
how to perform grouping by multiple columns. --------------------------------...
Pig : Entire Column Aggregations
Entire column aggregations. select sum(sal) from emp; grunt> describe emp ...
Pig : Word Count Using Pig Data Flow
Word Count Using Pig DataFlow: [cloudera@quickstart ~]$ cat comment hadoop is ...
Spark : Entire Column Aggregations
Entire Column Aggregations: sql: select sum(sal) from emp; scala> val e...
Spark : Handling CSV files .. Removing Headers
scala> val l = List(10,20,30,40,50,56,67) scala> val r2 = r.collect.reverse.ta...
Spark : Conditional Transformations
Conditions Transformations: val trans = emp.map{ x => val w = x.split...
Pig : CoGroup examples Vs Union Examples
-- co groupinggrunt> cat piglab/emp101,aaaa,40000,m,11102,bbbbbb,50000,f,12103...
Spark : Union and Distinct
Unions in spark.val l1 = List(10,20,30,40,50)val l2 = List(100,200,300,400,50...
Spark : CoGroup And Handling Empty Compact Buffers
Co Grouping using Spark:--------------------------scala> branch1.collect.forea...
Pig : load Operator
Load Operator:-------------- to load data from file to relation. [cloudera@qui...
Pig : Foreach Operator
Foreach Operator:-------------------grunt> emp = load 'piglab/emp' using PigSt...
Pig : Subsetting using Filter, Limit, Sample
Techniques of subsetting relations: i) filter: used for condiational filtering...
Spark : Joins
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal emp spLab/e[cloudera@quickst...
Spark : Joins 2
Denormalizing datasets using Joins[cloudera@quickstart ~]$ cat > childrenc101,...
Pig : Joins
[cloudera@quickstart ~]$ hadoop fs -cat spLab/e 101,aaaa,40000,m,11 102,bbbbbb,...
Pig : Order [ Sorting ] , exec, run , pig
order :- to sort data (tuples) in ascending or descending order. emp = l...
Pig : Cross Operator to Cartisian
Cross: ----- used cartisian product. each element of left set, joins ...
Pig : UDFs
Pig UDFS ---------- UDF ---> user defined functions. adv: i) ...
Spark : Spark streaming and Kafka Integration
steps: 1) start zookeper server 2) Start Kafka brokers [ one or more ] 3)...
Python Examples 1
name = input("Enter name ") age = input("Enter age") print(name, " is ", age...
Pig : Udfs using Python
we can keep multiple functions under one program(.py) transoform.py -------...
Hive Partitioned tables [case study]
[cloudera@quickstart ~]$ cat saleshistory 01/01/2011,2000 01/01/2011,3000 01/0...
Pig Video Lessons
Pig class Links: PigLab1 Video: https://drive.google.com/file/d/0B6ZYkhJgGD6XTz...
Hive(10AmTo1:00Pm) Lab1 notes : Hive Inner and Externa...
hive> create table samp1(line string); -- here we did not select any database. ...
Python Options in Hadoop
New developers in the Hadoop ecosystem often struggle to get involved because th...
16 Hadoop fs Commands Every Data Engineer Must Know
Commands in Hadoop The Hadoop shell is the CLI for the Hadoop cluster. Most of t...
Ultimate Hadoop Python Example
What are the options for using Python in Hadoop? Python developers are looking t...
How to Find HDFS Path URL?
Have you ever been running a script in from the HDFS command line gotten this er...
What’s New in Hadoop 3.0?
Major Hadoop Release! Hadoop 3.0 is has dropped! There is a lot of excitement in...
Freelance Hadoop Administrative Roles
Freelance Hadoop Admin Roles A lot of the world’s economy is shifting to freelan...
What is the Difference Between Spark & Hadoop
Spark & Hadoop Workloads are Huge Data Engineers and Big Data Developers spend a...
Learn HDFS Without Java?
HDFS Skills Without Java In the world of Hadoop and Big Data HDFS is king. Data ...
Certifications Required For Hadoop Administrators?
Hadoop Certifications Data Engineers looking to grow their careers are constantl...
Spark vs. Hadoop 2019
Spark vs. Hadoop 2019 In 2019 which skill is in more demand for Data Enginners S...
Hadoop: 3 Top Real Time Applications
Hadoop its massive data processing capability helps built many real-time applica...
Understanding Hadoop MapReduce Fault Tolerance
Hadoop MapReduce is totally different from other distributed systems. It handles...
Sqoop, Flume and Storm Understand The Differences Quickly
Top differences between Sqoop, Flume and Storm in Hadoop frame work
What is Adaptive MapReduce in Hadoop and How it Works
The performance and the approach of Adaptive MapReduce in Hadoop explained.
How to Copy HDFS files to Local Linux GET Vs copyToLocal
Two popular HDFS commands you can use to copy HDFS files to local Linux. I have ...
How to Install HBase Properly
HBase is a NoSQL database in the Hadoop framework. Correct installation needed t...
Industry’s First Auto-Scaling Hadoop Clusters
Background In 2009 I first started playing around with Hive and EC2/S3. I was bl...
Optimizing Hadoop for S3 – Part 1
Introduction: Users of Qubole Data Service use Hive queries or Hadoop jobs to pr...
Sqoop as a Service
Background: As Qubole Data Service has gained adoption – many of our customers a...
Case Study: Building Analytics Applications
This is a guest blog post written by Marc Rossen, a Qubole user, and advocate. T...
Caching in on the cloud!
Motivation One of the interesting things about using Hadoop and Hive in the clou...
Case Study: Big Data Cloud Computing – Part 1
The scalability of cloud databases and the potential of big data cloud computing...
Top 10 Industry Examples of HDFS
Top 10 Industry Examples of HDFS Not everyone comes to us with a clear strategy ...
Qubole Available on Google Compute Engine
Qubole is a leading provider of Hadoop as a service with the mission of providin...
Save Time Executing Hive Queries Using Command Templates
A common characteristic of many analytics queries is that they are mostly invari...
Hadoop Cloud vs On-Premise Hadoop
As topics of conversation go, the terms “Big Data” and “Hadoop functionality” se...
Komli Media Improves Utilization with Premium Big Data ...
Komli Media, Asia Pacific’s leading media technology company, depends on reachin...
Accenture Technology Labs Hadoop Deployment Comparison ...
Background The Accenture Technology Labs Hadoop Deployment Comparison study rece...
The Challenges and Opportunities for E-commerce in a Bi...
The highly competitive world of e-Commerce is driven by price and advertising. C...
Job Scheduling in Hadoop – A 7 Year Perspective
In a recent presentation at Flipkart’s 2014 SlashN conference, I summarized seve...
Announcing General Availability of Presto-as-a-Service
Presto Ready! We announced our Presto-as-a-Service Alpha Program on Amazon Web S...
Hadoop in the Cloud: Qubole shows 2x – 8x speedup in pe...
Qubole aims to provide the best platform for big data analysis in the cloud. In ...
Forbes: Qubole Data Service Road to Hadoop
On Monday, May 26, 2014, Qubole was featured on Forbes.com. Technology contribut...
Qubole Founders Open Up About the Transformation of Hadoop
Seven years ago, Joydeep Sen Sarma and Ashish Thusoo were first introduced to bi...
Hadoop vs Traditional Databases: Big Data Considerations
Today’s ultra-connected world is generating massive volumes of data at ever-acce...
Top 5 Big Data Myths Debunked
The era of big data has arrived. Today, companies both large and small are disco...
Securely sharing data across Organizations with Qubole
Customers love that Qubole enables collaboration via a shared workbench across m...
High Performance Hadoop with New Generation AWS Instances
Welcome New Generation Instance Types Amazon Web Services (AWS) offers a range o...
MapReduce vs Apache Spark
Cluster Computing Comparisons: MapReduce vs Apache Spark Since its early beginni...
Not All Hadoop Distributions are Created Equal
The debate is over. Big data analytics has proven benefits. And organizations lo...
Looking Forward: Hadoop Industry Trends
From its primitive beginnings as a modest open-source search engine called “Nutc...
Hadoop with Enhanced Networking on AWS
Introduction At Qubole, many of our customers run their Hadoop clusters on AWS E...
Apache Hadoop 2.6.0 Now Generally Available on Qubole
We’re excited to announce that Apache Hadoop 2.6.0, the latest stable release* o...
Hadoop is Hard! But Big Data Doesn’t Have To Be
When it comes to big data analytics, Hadoop has been heralded as the all-in-one ...
Drag-n-Drop upgrades of Hadoop, Spark and Presto Clusters
Introduction As the Big Data stack has matured, many companies have started usin...
Multi-tenant Job History Server for Ephemeral Hadoop an...
Introduction Qubole Data Service (QDS) allows users to configure logical Hadoop ...
Riding the Spotted Elephant
Introduction: One of the benefits of moving Hadoop workloads to the cloud is red...
The Main Types of Big Data Vendors: A Comparative Look
The big data boom has given rise to a host of vendors, each promoting their own ...
Keeping Big Data Safe: Common Hadoop Security Issues an...
The big data explosion has given rise to a host of Information technology tools ...
Apache Spark vs Hadoop
Which Big Data Framework is the Best Fit? Apache Hadoop wasn’t just the “elepha...
Cassandra vs Hadoop: A Comparative Look
Technology is reshaping our world. The proliferation of mobile devices, the expl...
Hadoop Data Warehouse
This post was originally published in August 2014 and has since been updated. Bi...
Big Data Implementation
Big Data Challenges With all the hype, it’s little wonder that organizations ar...
The Big Data Lifecycle At TubeMogul
This post was written by Chris Chanyi, Senior Data Architect at TubeMogul. It or...
RubiX: Fast Cache Access for Big Data Analytics on Clou...
Qubole introduced first-generation Caching for S3 files in Presto in 2014 and do...
Quark: Control and Optimize SQL Across Hadoop and RDBMS
One of the important functions of a database administrator is to manage storage ...
Qubole announces Heterogeneous Clusters on AWS – Reduce...
Co-authored by Hariharan Iyer, Member of the Technical Staff at Qubole. Introduc...
Auto-scaling in Qubole With AWS Elastic Block Storage
Co-authored by Hariharan Iyer, Member of the Technical Staff at Qubole. Amazon E...
Data Platforms 2017: The Conference I Wish Existed in 2007
This post is authored by Ashish Thusoo, Co-Founder and Chief Executive Officer, ...
Container Packing: A New Algorithm for Resource Schedul...
Container Packing for Resource Scheduling in the Cloud In this post we describe ...