Technically engineers could architect the system to allow for additional functionality for analysis on top of MapReduce, but now Yarn acts as a platform for hosting apps for that specific purpose. Twitter Storm is one, while other ETL (extract, transform and load) apps could be integrated as well. There are many more applications expected for streaming analytics. Tez uses real-time analytics and in-memory processing for higher-speed queries, for example. Already some projects, like Apache Tez, have been created to do more advanced data processing compared to what MapReduce specializes in. Hadoop backers expect that the advent of Yarn could open the floodgates for new applications being built to run on Hadoop. MapReduce can also now focus on its core functionality instead of managing resources for bolt-on apps. Second, developers can now write apps to Yarn specifications and be assured that they'll work in a Hadoop system. For one, Hadoop is adding functionally to run multiple applications at once. Blogger Brian Proffitt at ReadWrite notes that Yarn removes "one-at-a-time" limitations of apps running on Hadoop, and allows the Hadoop systems to now run multiple applications at once. "Yarn is fundamentally simple, but extremely scalable," says Arun Murthy, co-founder of Hadoop distribution company Hortonworks, who has been in charge of developing Yarn within the Apache open source community. It works by monitoring what resources applications need, then creates containers of CPU and RAM nodes to serve to those apps. Yarn splits up two major functions currently combined into one by MapReduce it separates job scheduling/monitoring and resource management. The biggest change though is the addition of Yarn, which has been in planning for four years and under development for two and been described by some as a next-generation MapReduce architecture. The 2.0 version adds a number of components, including architecting for high availability, and adding scale to individual clusters, allowing them to grow to 4,000 machines (a Hadoop deployment can consist of multiple clusters). "It fixes some major gaps and will enable some exciting developments in the years ahead." "Yarn is on the critical path to Hadoop having better resource management and supporting mixed workloads and usages," says Gartner information management analyst Merv Adrian, who tracks Hadoop. Hadoop enthusiasts say this is an important feature to let more applications run within the big data open system and could lead to a wave of new analytics apps for Hadoop. Yarn monitors the resources applications need and then provisions the capacity within the distributed computing system. Hadoop systems have thus far relied on MapReduce to process data, but included in the latest iteration of the open source code is Yarn, which is a platform to run other applications within Hadoop alongside MapReduce. Hadoop has proven itself as a powerful way for some of the leading technology companies in the world like Yahoo and Google to manage large amounts of data. The latest release of Apache Hadoop code includes a new workload management tool that backers of the project say will make it easier for developers to build applications for the big data platform.
0 Comments
Leave a Reply. |