SlideShare a Scribd company logo
1 of 34
Design, Scale & Performance of the
        MapR Distribution

                       M.C. Srivas
             CTO, MapR Technologies, Inc.




 6/29/2011        © MapR Technologies, Inc.   1
Outline of Talk
• What does MapR do?
• Motivation: why build this?
• Distributed NameNode Architecture
  • Scalability factors
  • Programming model
  • Distributed transactions in MapR
• Performance across a variety of loads


     6/29/2011      © MapR Technologies, Inc.   2
Complete Distribution
   Integrated, tested,
    hardened
   Super simple
   Unique advanced
    features
   100% compatible
    with MapReduce,
    HBase, HDFS APIs
   No recompile
    required, drop in
    and use now



          6/29/2011       © MapR Technologies, Inc.   3
MapR Areas of Development

                        HBase                       Map
                                                   Reduce
   Ecosystem


            Storage
                                               Management
            Services




6/29/2011              © MapR Technologies, Inc.            4
JIRAs Open For Year(s)
• HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal
• HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize
  block-reports
• HDFS-222, 950 – Concatenate files into larger files
   • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10
     million files, each using a block, would use about 3 gigabytes of memory.
     Scaling up much beyond this level is a problem with current hardware.
     Certainly a billion files is not feasible."
• HDFS Append – no 'blessed' Apache Hadoop distro has fix
• HDFS-233 – 25/Jun/08 – Snapshot support
   • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly
     only if there is complete separation between namespace management and
     block management."



         6/29/2011            © MapR Technologies, Inc.                           5
Observations on Apache Hadoop
   Inefficient HDFS-347                            1200

                                                                      MB/sec
   Scaling problems HDFS-273                       1000


       NameNode bottleneck HDFS-395                     800

       Limited number of files HDFS-222                 600                   READ
                                                                               WRITE
   Admin overhead significant                           400


   NameNode failure loses data                          200

       Not trusted as permanent store                     0
                                                               HARDWARE HDFS
   Write-once
       Data lost unless file closed
          hflush/hsync – unrealistic to expect folks will re-write apps


          6/29/2011          © MapR Technologies, Inc.                          6
MapR Approach
• Some are architectural issues
• Change at that level is a big deal
  – Will not be accepted unless proven
  – Hard to prove without building it first


• Build it and prove it
  – Improve reliability significantly
  – Make it tremendously faster at the same time
  – Enable new class of apps (eg, real-time analytics)
       6/29/2011      © MapR Technologies, Inc.          7
HDFS Architecture Review
   Files are broken into blocks
     Distributed across data-nodes
   NameNode holds (in memory)
       Directories, Files
                                                                     Files
       Block replica locations                                      sharded into
                                                                     blocks
   Data Nodes
       Serve blocks
       No idea about files/dirs
       All ops go to NN
                                                  DataNodes save Blocks
          6/29/2011          © MapR Technologies, Inc.                        8
HDFS Architecture Review
DataNode (DN) reports blocks to
                                                                NameNode
NameNode (NN)
      Large DN does 60K blocks/report
          256M x 60K = 15T = 5 disks @ 3T per
                                                         DataNode     DataNode
      >100K causes extreme load
      40GB NN restart takes 1-2 hours

Addressing Unit is an individual block
      Flat block-address forces DN's to send giant block-reports
      NN can hold about ~300M blocks max
         Limits cluster size to 10's of Petabytes
         Increasing block size negatively impacts map/reduce
           6/29/2011         © MapR Technologies, Inc.                       9
How to Scale
• Central meta server does not scale
  – Make every server a meta-data server too
  – But need memory for map/reduce
    •    Must page meta-data to disk
• Reduce size of block-reports
  – while increasing number of blocks per DN
• Reduce memory footprint of location service
  – cannot add memory indefinitely
• Need fast-restart (HA)
        6/29/2011      © MapR Technologies, Inc.   10
MapR Goal: Scale to 1000X
                       HDFS                         MapR
     # files           150 million                  1 trillion
     # data            10-50 PB                     1-10 Exabytes
     # nodes           2000                         10,000+

Full random read/write semantics
      export via NFS and other protocols
      with enterprise-class reliability: instant-restart, snapshots,
       mirrors, no-single-point-of-failure, …
Run close to hardware speeds
      On extreme scale, efficiency matters extremely
      exploit emerging technology like SSD, 10GE

       6/29/2011        © MapR Technologies, Inc.                   11
MapR's Distributed NameNode
                    Files/directories are sharded into blocks, which
                    are placed into mini NNs (containers ) on disks
                                                        Each container contains
                                                            Directories & files
                                                            Data blocks
                                                        Replicated on servers
Containers are 16-                                      No need to manage
32 GB segments of                                        directly
disk, placed on                                         Use MapR Volumes
nodes

                                                                Patent Pending

        6/29/2011            © MapR Technologies, Inc.                           12
MapR Volumes
                          Significant advantages over “Cluster-
/projects                 wide” or “File-level” approaches

        /tahoe              Volumes allow management attributes
        /yosemite           to be applied in a scalable way at a
                            very granular level and with flexibility
/user
        /msmith             •   Replication factor
                            •   Scheduled mirroring
        /bjohnson           •   Scheduled snapshots
                            •   Data placement control
100K volumes are OK,        •   Usage tracking
  create as many as         •   Administrative permissions
       desired!


          6/29/2011      © MapR Technologies, Inc.                13
MapR Distributed NameNode
Containers are tracked globally
•   Clients cache containers & server info for extended periods

NameNode Map

     S1, S2, S4
                                      Client                        S1
                      Fetches                          Contacts
     S1, S3
                      container                        server to
     S1, S4, S5       locations                        read data
     S2, S3, S5                                        from the
                                                                        S3
                                                       container
     S2, S4, S5
     S3

                                                              S4   S5
                                         S2


          6/29/2011        © MapR Technologies, Inc.                14
MapR's Distr NameNode Scaling
Containers represent 16 - 32GB of data
      Each can hold up to 1 Billion files and directories
      100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
      25GB to cache all containers for 2EB cluster
         But not necessary, can page to disk
      Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
      Serve 100x more data-nodes
      Increase container size to 64G to serve 4EB cluster
            Map/reduce not affected


            6/29/2011        © MapR Technologies, Inc.         15
MapR Distr NameNode HA
MapR                           Apache Hadoop*
1. apt-get install mapr-cldb   1.    Stop cluster very carefully
while cluster is online        2.    Move fs.checkpoint.dir onto NAS (eg. NetApp)
                               3.    Install, configure DRBD + Heartbeat packages
                                       i. yum -y install drbd82 kmod-drbd82 heartbeat
                                      ii. chkconfig -add heartbeat (both machines)
                                      iii. edit /etc/drbd.conf on 2 machines
                                      iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero
                                     it if drbd dies & try again
                                      xxxx. mkfs ext3 on it, mount /hadoop (both machines)
                                      xxxxi. install all rpms in /hadoop, but don't run them yet
                                     (chkconfig off)
                                      xxxxii. umount /hadoop (!!)
                                      xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat
                               ...
                               40. Restart cluster. If any problems, start at
                                   /var/log/ha.log for hints on what went wrong.


*As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration
 Author: Christophe Bisciglia, Cloudera.
                6/29/2011       © MapR Technologies, Inc.                                 16
Step Back & Rethink Problem
Big disruption in hardware landscape
                             Year 2000                Year 2012
        # cores per box              2                   128
        DRAM per box               4GB                 512GB
        # disks per box           250+                   12
        Disk capacity             18GB                   6TB
        Cluster size               2-10                10,000

   No spin-locks / mutexes, 10,000+ threads
   Minimal footprint – preserve resources for App
   Rapid re-replication, scale to several Exabytes

        6/29/2011         © MapR Technologies, Inc.               17
MapR's Programming Model
Written in C++ and is asynchronous
       ioMgr->read(…, callbackFunc, void *arg)
Each module runs requests from its request-queue
       One OS thread per cpu-core
       Dispatch: map container-> queue -> cpu-core
       Callback guaranteed to be invoked on same core
             No mutexes needed
       When load increases, add cpu-core + move some queues
        to it
State machines on each queue
       'thread stack' is 4K, 10,000+ threads costs ~40M
       Context-switch is 3 instructions, 250K c.s./core/sec ok!
             6/29/2011       © MapR Technologies, Inc.             18
MapR on Linux
   User-space process, avoids system crashes
   Minimal footprint
        Preserves cpu, memory & resources for app
            uses only 1/5th of system memory
            runs on 1 or 2 cores, others left for app
        Emphasis on efficiency, avoids lots of layering
         raw devices, direct-IO, doesn't use Linux VM
   CPU/memory firewalls implemented
      runaway tasks no longer impact system processes




          6/29/2011        © MapR Technologies, Inc.       19
Random Writing in MapR
                              S1
             Ask for
Client
             64M block                                      NameNode Map
writing                                    Create cont.
 data                                                           S1, S2, S4
                         attach                                 S1, S3
Write                                                           S1, S4, S5
next chunk       S2
                                     Picks master               S2, S4, S5
to S2
                                     and 2 replica slaves       S3
                                                                S2, S3, S5




                                                     S4              S5
                         S3


     6/29/2011           © MapR Technologies, Inc.                         20
MapR's Distributed NameNode
   Distributed transactions to stitch containers together
   Each node uses write-ahead log
       Supports both value-logging and operational-logging
          Value log, record = { disk-offset, old, new }
          Op log, record = { op-details, undo-op, redo-op }
       Recovery in 2 seconds
       'global ids' enable participation in distributed
        transactions




             6/29/2011    © MapR Technologies, Inc.            21
2-Phase Commit Unsuitable
                                                                  App
• BeginTrans .. work .. Commit              C = coordinator
                                                                                Force
                                            P = participant        C            Log
   On app-commit
       C sends prepare to P                         P
        P sends prepare-ack,                                            P
                                                    C
        gives up right to abort                                         C
          Waits for C even across

            crashes/reboots                                             P
                                                         P
       P unlocks only when                                             C
        commit received
Too many message exchanges
                                                              P             P
Single failure can lock up entire cluster
          6/29/2011      © MapR Technologies, Inc.                              22
Quorum-completion Unsuitable
• BeginTrans .. work .. Commit                              C = coordinator
                                                            P = participant
   On app-commit
       C broadcasts prepare                                           P
       If majority responds,                         App
        C commits                                      C
       If not, cluster goes                                               P
        into election mode
       If no majority found, all fails                P
Update throughput very poor                                        P

Does not work with < N/2 nodes
Monolithic. Hierarchical? Cycles? Oh No!!
          6/29/2011       © MapR Technologies, Inc.                            23
MapR Lockless Transactions
• BeginTrans + work + Commit
   No explicit commit                                       NN1
                                                            NN1
                                                            NN1
   Uses rollback
       confirm callback, piggy-backed
       Undo on confirmed failure                     NN4
                                                      NN4          NN2
                                                                    NN2
       Any replica can confirm                                      NN2
Update throughput very high
No locks held across messages
Crash resistant, cycles OK                                     NN3
                                                              NN3
                                                            NN3
Patent pending

           6/29/2011      © MapR Technologies, Inc.                    24
Small Files (Apache Hadoop, 10 nodes)

                                     Out of box
                                                                     Op: - create file
Rate (files/sec)




                                                                         - write 100 bytes
                                                      Tuned              - close
                                                                     Notes:
                                                                     - NN not replicated
                                                                     - NN uses 20G DRAM
                                                                     - DN uses 2G DRAM



                                  # of files (m)

                      6/29/2011          © MapR Technologies, Inc.                         25
MapR Distributed NameNode
Same 10 nodes, but with 3x replication added …

                                                     Test
                                                   stopped
 Create                                              here
  Rate

100-byte
files/sec




                           # of files (millions)
          6/29/2011    © MapR Technologies, Inc.             26
MapR's Data Integrity
   End-to-end check-sums on all data (not optional)
      Computed in client's memory, written to disk at server
      On read, validated at both client & server


   RPC packets have own independent check-sum
      Detects RPC msg corruption


   Transactional with ACID semantics
      Meta data incl. log itself is check-summed


      Allocation bitmaps written to two places (dual blocks)


   Automatic compression built-in


        6/29/2011      © MapR Technologies, Inc.           27
MapR’s Random-Write Eases Data Import
With MapR, use NFS                  Otherwise, use Flume/Scribe
1. mount /mapr                      1. Set up sinks (find unused
   real-time, HA                        machines??)
                                    2. Set up intrusive agents
                                        i. tail(“xxx”), tailDir(“y”)
                                        ii. agentBESink
                                    3. All reliability levels lose data
                                        i. best-effort
                                        ii. one-shot
                                        iii. disk fail-over
                                        iv. end-to-end
                                    4. Data not available now


          6/29/2011   © MapR Technologies, Inc.                           28
MapR's Streaming Performance
        2250                                      2250
                     11 x 7200rpm SATA                                    11 x 15Krpm SAS
        2000                                      2000
        1750                                      1750
        1500                                      1500
        1250                                      1250                              Hardware
                                                                                    MapR
        1000                                      1000
MB                                                                                  Hadoop
         750                                       750
per
sec      500                                       500
         250                                       250
           0                                          0
                   Read          Write                          Read   Write
                                         Higher is better


      Tests:    i. 16 streams x 120GB            ii. 2000 streams x 1GB

               6/29/2011            © MapR Technologies, Inc.                         29
HBase on MapR
                YCSB Insert with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          600

          500

          400
 1000
records   300                                                MapR
  per                                                        Apache
second    200

          100

            0
                      WAL off                    WAL on     Higher is better


          6/29/2011             © MapR Technologies, Inc.                      30
HBase on MapR
          YCSB Random Read with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          25000

          20000

Records   15000
  per                                                       MapR
second                                                      Apache
          10000

           5000

               0
                       Zipfian                 Uniform    Higher is better


          6/29/2011          © MapR Technologies, Inc.                       31
Terasort on MapR
      10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
          60                                      300

          50                                      250

          40                                      200

Elapsed                                           150
                                                                         MapR
          30
time                                                                     Hadoop
(mins)    20                                      100

          10                                       50


           0                                         0
                           1.0 TB                               3.5 TB

                                         Lower is better

               6/29/2011            © MapR Technologies, Inc.            32
PigMix on MapR
       4000

       3500

       3000

       2500

       2000
Time                                                     MapR
in     1500                                              Hadoop
Sec
       1000

        500

          0




                                 Lower is better
              6/29/2011      © MapR Technologies, Inc.    33
Summary
   Fully HA
       JobTracker, Snapshot, Mirrors, multi-cluster capable
   Super simple to manage
       NFS mountable
   Complete read/write semantics
       Can see file contents immediately
   MapR has signed Apache CCLA
       Zookeeper, Mahout, YCSB, HBase fixes contributed
       Continue to contribute more and more
   Download it at www.mapr.com
         6/29/2011      © MapR Technologies, Inc.              34

More Related Content

What's hot

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”Ruggero Citton
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform LoadABDUL KHALIQ
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseCloudera, Inc.
 
Same plan different performance
Same plan different performanceSame plan different performance
Same plan different performanceMauro Pagano
 
Advanced RAC troubleshooting: Network
Advanced RAC troubleshooting: NetworkAdvanced RAC troubleshooting: Network
Advanced RAC troubleshooting: NetworkRiyaj Shamsudeen
 
Keep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSKeep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSDataWorks Summit
 
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...HostedbyConfluent
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateMichael Rainey
 

What's hot (20)

Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”gDBClone - Database Clone “onecommand Automation Tool”
gDBClone - Database Clone “onecommand Automation Tool”
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform Load
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
Same plan different performance
Same plan different performanceSame plan different performance
Same plan different performance
 
Advanced RAC troubleshooting: Network
Advanced RAC troubleshooting: NetworkAdvanced RAC troubleshooting: Network
Advanced RAC troubleshooting: Network
 
Keep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFSKeep me in the Loop: INotify in HDFS
Keep me in the Loop: INotify in HDFS
 
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
 
Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1
 

Viewers also liked

Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessMongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsBack to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsMongoDB
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBMongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationMongoDB
 

Viewers also liked (10)

Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsBack to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica Sets
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB Application
 

Similar to Design, Scale and Performance of MapR's Distribution for Hadoop

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...inside-BigData.com
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Open Stack
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 

Similar to Design, Scale and Performance of MapR's Distribution for Hadoop (20)

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Design, Scale and Performance of MapR's Distribution for Hadoop

  • 1. Design, Scale & Performance of the MapR Distribution M.C. Srivas CTO, MapR Technologies, Inc. 6/29/2011 © MapR Technologies, Inc. 1
  • 2. Outline of Talk • What does MapR do? • Motivation: why build this? • Distributed NameNode Architecture • Scalability factors • Programming model • Distributed transactions in MapR • Performance across a variety of loads 6/29/2011 © MapR Technologies, Inc. 2
  • 3. Complete Distribution  Integrated, tested, hardened  Super simple  Unique advanced features  100% compatible with MapReduce, HBase, HDFS APIs  No recompile required, drop in and use now 6/29/2011 © MapR Technologies, Inc. 3
  • 4. MapR Areas of Development HBase Map Reduce Ecosystem Storage Management Services 6/29/2011 © MapR Technologies, Inc. 4
  • 5. JIRAs Open For Year(s) • HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal • HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize block-reports • HDFS-222, 950 – Concatenate files into larger files • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible." • HDFS Append – no 'blessed' Apache Hadoop distro has fix • HDFS-233 – 25/Jun/08 – Snapshot support • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly only if there is complete separation between namespace management and block management." 6/29/2011 © MapR Technologies, Inc. 5
  • 6. Observations on Apache Hadoop  Inefficient HDFS-347 1200 MB/sec  Scaling problems HDFS-273 1000  NameNode bottleneck HDFS-395 800  Limited number of files HDFS-222 600 READ WRITE  Admin overhead significant 400  NameNode failure loses data 200  Not trusted as permanent store 0 HARDWARE HDFS  Write-once  Data lost unless file closed  hflush/hsync – unrealistic to expect folks will re-write apps 6/29/2011 © MapR Technologies, Inc. 6
  • 7. MapR Approach • Some are architectural issues • Change at that level is a big deal – Will not be accepted unless proven – Hard to prove without building it first • Build it and prove it – Improve reliability significantly – Make it tremendously faster at the same time – Enable new class of apps (eg, real-time analytics) 6/29/2011 © MapR Technologies, Inc. 7
  • 8. HDFS Architecture Review  Files are broken into blocks  Distributed across data-nodes  NameNode holds (in memory)  Directories, Files Files  Block replica locations sharded into blocks  Data Nodes  Serve blocks  No idea about files/dirs  All ops go to NN DataNodes save Blocks 6/29/2011 © MapR Technologies, Inc. 8
  • 9. HDFS Architecture Review DataNode (DN) reports blocks to NameNode NameNode (NN)  Large DN does 60K blocks/report  256M x 60K = 15T = 5 disks @ 3T per DataNode DataNode  >100K causes extreme load  40GB NN restart takes 1-2 hours Addressing Unit is an individual block  Flat block-address forces DN's to send giant block-reports  NN can hold about ~300M blocks max  Limits cluster size to 10's of Petabytes  Increasing block size negatively impacts map/reduce 6/29/2011 © MapR Technologies, Inc. 9
  • 10. How to Scale • Central meta server does not scale – Make every server a meta-data server too – But need memory for map/reduce • Must page meta-data to disk • Reduce size of block-reports – while increasing number of blocks per DN • Reduce memory footprint of location service – cannot add memory indefinitely • Need fast-restart (HA) 6/29/2011 © MapR Technologies, Inc. 10
  • 11. MapR Goal: Scale to 1000X HDFS MapR # files 150 million 1 trillion # data 10-50 PB 1-10 Exabytes # nodes 2000 10,000+ Full random read/write semantics  export via NFS and other protocols  with enterprise-class reliability: instant-restart, snapshots, mirrors, no-single-point-of-failure, … Run close to hardware speeds  On extreme scale, efficiency matters extremely  exploit emerging technology like SSD, 10GE 6/29/2011 © MapR Technologies, Inc. 11
  • 12. MapR's Distributed NameNode Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks  Each container contains  Directories & files  Data blocks  Replicated on servers Containers are 16-  No need to manage 32 GB segments of directly disk, placed on  Use MapR Volumes nodes Patent Pending 6/29/2011 © MapR Technologies, Inc. 12
  • 13. MapR Volumes Significant advantages over “Cluster- /projects wide” or “File-level” approaches /tahoe Volumes allow management attributes /yosemite to be applied in a scalable way at a very granular level and with flexibility /user /msmith • Replication factor • Scheduled mirroring /bjohnson • Scheduled snapshots • Data placement control 100K volumes are OK, • Usage tracking create as many as • Administrative permissions desired! 6/29/2011 © MapR Technologies, Inc. 13
  • 14. MapR Distributed NameNode Containers are tracked globally • Clients cache containers & server info for extended periods NameNode Map S1, S2, S4 Client S1 Fetches Contacts S1, S3 container server to S1, S4, S5 locations read data S2, S3, S5 from the S3 container S2, S4, S5 S3 S4 S5 S2 6/29/2011 © MapR Technologies, Inc. 14
  • 15. MapR's Distr NameNode Scaling Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster  But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected 6/29/2011 © MapR Technologies, Inc. 15
  • 16. MapR Distr NameNode HA MapR Apache Hadoop* 1. apt-get install mapr-cldb 1. Stop cluster very carefully while cluster is online 2. Move fs.checkpoint.dir onto NAS (eg. NetApp) 3. Install, configure DRBD + Heartbeat packages i. yum -y install drbd82 kmod-drbd82 heartbeat ii. chkconfig -add heartbeat (both machines) iii. edit /etc/drbd.conf on 2 machines iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero it if drbd dies & try again xxxx. mkfs ext3 on it, mount /hadoop (both machines) xxxxi. install all rpms in /hadoop, but don't run them yet (chkconfig off) xxxxii. umount /hadoop (!!) xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat ... 40. Restart cluster. If any problems, start at /var/log/ha.log for hints on what went wrong. *As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration Author: Christophe Bisciglia, Cloudera. 6/29/2011 © MapR Technologies, Inc. 16
  • 17. Step Back & Rethink Problem Big disruption in hardware landscape Year 2000 Year 2012 # cores per box 2 128 DRAM per box 4GB 512GB # disks per box 250+ 12 Disk capacity 18GB 6TB Cluster size 2-10 10,000  No spin-locks / mutexes, 10,000+ threads  Minimal footprint – preserve resources for App  Rapid re-replication, scale to several Exabytes 6/29/2011 © MapR Technologies, Inc. 17
  • 18. MapR's Programming Model Written in C++ and is asynchronous ioMgr->read(…, callbackFunc, void *arg) Each module runs requests from its request-queue  One OS thread per cpu-core  Dispatch: map container-> queue -> cpu-core  Callback guaranteed to be invoked on same core  No mutexes needed  When load increases, add cpu-core + move some queues to it State machines on each queue  'thread stack' is 4K, 10,000+ threads costs ~40M  Context-switch is 3 instructions, 250K c.s./core/sec ok! 6/29/2011 © MapR Technologies, Inc. 18
  • 19. MapR on Linux  User-space process, avoids system crashes  Minimal footprint  Preserves cpu, memory & resources for app  uses only 1/5th of system memory  runs on 1 or 2 cores, others left for app  Emphasis on efficiency, avoids lots of layering raw devices, direct-IO, doesn't use Linux VM  CPU/memory firewalls implemented  runaway tasks no longer impact system processes 6/29/2011 © MapR Technologies, Inc. 19
  • 20. Random Writing in MapR S1 Ask for Client 64M block NameNode Map writing Create cont. data S1, S2, S4 attach S1, S3 Write S1, S4, S5 next chunk S2 Picks master S2, S4, S5 to S2 and 2 replica slaves S3 S2, S3, S5 S4 S5 S3 6/29/2011 © MapR Technologies, Inc. 20
  • 21. MapR's Distributed NameNode  Distributed transactions to stitch containers together  Each node uses write-ahead log  Supports both value-logging and operational-logging  Value log, record = { disk-offset, old, new }  Op log, record = { op-details, undo-op, redo-op }  Recovery in 2 seconds  'global ids' enable participation in distributed transactions 6/29/2011 © MapR Technologies, Inc. 21
  • 22. 2-Phase Commit Unsuitable App • BeginTrans .. work .. Commit C = coordinator Force P = participant C Log  On app-commit  C sends prepare to P P P sends prepare-ack, P  C gives up right to abort C  Waits for C even across crashes/reboots P P  P unlocks only when C commit received Too many message exchanges P P Single failure can lock up entire cluster 6/29/2011 © MapR Technologies, Inc. 22
  • 23. Quorum-completion Unsuitable • BeginTrans .. work .. Commit C = coordinator P = participant  On app-commit  C broadcasts prepare P  If majority responds, App C commits C  If not, cluster goes P into election mode  If no majority found, all fails P Update throughput very poor P Does not work with < N/2 nodes Monolithic. Hierarchical? Cycles? Oh No!! 6/29/2011 © MapR Technologies, Inc. 23
  • 24. MapR Lockless Transactions • BeginTrans + work + Commit  No explicit commit NN1 NN1 NN1  Uses rollback  confirm callback, piggy-backed  Undo on confirmed failure NN4 NN4 NN2 NN2  Any replica can confirm NN2 Update throughput very high No locks held across messages Crash resistant, cycles OK NN3 NN3 NN3 Patent pending 6/29/2011 © MapR Technologies, Inc. 24
  • 25. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file Rate (files/sec) - write 100 bytes Tuned - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM # of files (m) 6/29/2011 © MapR Technologies, Inc. 25
  • 26. MapR Distributed NameNode Same 10 nodes, but with 3x replication added … Test stopped Create here Rate 100-byte files/sec # of files (millions) 6/29/2011 © MapR Technologies, Inc. 26
  • 27. MapR's Data Integrity  End-to-end check-sums on all data (not optional)  Computed in client's memory, written to disk at server  On read, validated at both client & server  RPC packets have own independent check-sum  Detects RPC msg corruption  Transactional with ACID semantics  Meta data incl. log itself is check-summed  Allocation bitmaps written to two places (dual blocks)  Automatic compression built-in 6/29/2011 © MapR Technologies, Inc. 27
  • 28. MapR’s Random-Write Eases Data Import With MapR, use NFS Otherwise, use Flume/Scribe 1. mount /mapr 1. Set up sinks (find unused real-time, HA machines??) 2. Set up intrusive agents i. tail(“xxx”), tailDir(“y”) ii. agentBESink 3. All reliability levels lose data i. best-effort ii. one-shot iii. disk fail-over iv. end-to-end 4. Data not available now 6/29/2011 © MapR Technologies, Inc. 28
  • 29. MapR's Streaming Performance 2250 2250 11 x 7200rpm SATA 11 x 15Krpm SAS 2000 2000 1750 1750 1500 1500 1250 1250 Hardware MapR 1000 1000 MB Hadoop 750 750 per sec 500 500 250 250 0 0 Read Write Read Write Higher is better Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 6/29/2011 © MapR Technologies, Inc. 29
  • 30. HBase on MapR YCSB Insert with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 600 500 400 1000 records 300 MapR per Apache second 200 100 0 WAL off WAL on Higher is better 6/29/2011 © MapR Technologies, Inc. 30
  • 31. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 25000 20000 Records 15000 per MapR second Apache 10000 5000 0 Zipfian Uniform Higher is better 6/29/2011 © MapR Technologies, Inc. 31
  • 32. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm 60 300 50 250 40 200 Elapsed 150 MapR 30 time Hadoop (mins) 20 100 10 50 0 0 1.0 TB 3.5 TB Lower is better 6/29/2011 © MapR Technologies, Inc. 32
  • 33. PigMix on MapR 4000 3500 3000 2500 2000 Time MapR in 1500 Hadoop Sec 1000 500 0 Lower is better 6/29/2011 © MapR Technologies, Inc. 33
  • 34. Summary  Fully HA  JobTracker, Snapshot, Mirrors, multi-cluster capable  Super simple to manage  NFS mountable  Complete read/write semantics  Can see file contents immediately  MapR has signed Apache CCLA  Zookeeper, Mahout, YCSB, HBase fixes contributed  Continue to contribute more and more  Download it at www.mapr.com 6/29/2011 © MapR Technologies, Inc. 34