What is the Hadoop ecosystem and how does Apache Spark fit in? -
i'm having lot of trouble grasping 'hadoop ecosystem' conceptually. understand have data processing tasks want run , use mapreduce split job smaller pieces i'm unsure people mean when 'hadoop ecosystem'. i'm unclear benefits of apache spark , why seen revolutionary? if it's in-memory calculation, wouldn't mean need higher ram machines run spark jobs? how spark different writing parallelized python code or of nature.
your question rather broad - hadoop ecosystem wide range of technologies either support hadoop mapreduce, make easier apply, or otherwise interact stuff done.
examples:
- the hadoop distributed filesystem (hdfs) stores data processed mapreduce jobs, in scalable redundant distributed fashion.
- apache pig provides language, pig latin, expressing data flows compiled down mapreduce jobs
- apache hive provides sql-like language querying huge datasets stored in hdfs
there many, many others - see example https://hadoopecosystemtable.github.io/
spark not in-memory; can perform calculations in-memory if enough ram available, , can spill data on disk when required.
it particularly suitable iterative algorithms, because data previous iteration can remain in memory. provides different (and more concise) programming interface, compared plain hadoop. can provide performance advantages when work done on disk rather in-memory. supports streaming batch jobs. can used interactively, unlike hadoop.
spark relatively easy install , play with, compared hadoop, suggest give try understand better - experimentation can run off normal filesystem , not require hdfs installed. see documentation.
Comments
Post a Comment