Skip to main content

spark-Pyspark

 spark :


in-memery computation
spark : Data extracting from hard disk and store into ram. and each step operation are store in ram and only after completion of job it write to hard disk.

hadoop mapreduce : it perform some operation and store to hard drive every time for each step. it read and write from hard disk every step of the job. so latency is high.

Lazy execution :

when applying function to read the data it does not read because we do not performing any opearaiton. it does not read data until we perform some operation or computation. suppose we do in pandas pd.read_csv it read data store in ram.

Parallel Processing :

distributed the data into different cluster and stored in nodes.

batch processing and real-time processing ex. credit card transaction. genuine or fake





Popular posts from this blog

cammand for installing library in python

 Command for installing in jupyter notebook:                pip install library_name                ex. pip install nump installing from anaconda prompt:           1. pip install numpy           2.   conda install -c conda-forge matplotlib search for conda command for matplotlib and go to official website. Installing from anaconda navigator easy. Somtime give error then open as administrator

spark-scala-python

 ############sparkcontest######33333 it is used in earlier spark 1.x //scala  import org.apache.spark.SparkConf     import org.apache.spark.SparkContext     val conf = new SparkConf().setAppName("first").setMaster("local[*]")     val sc = new SparkContext(conf) val rdd1 = sc.textFile("C:/workspace/data/txns") # python  from pyspark import SparkContext,SparkConf     conf = SparkConf().setAppName("first").setMaster("local[*])     sc = SparkContext(conf)      ## now days sparksession are used  ########range######### // in Scala val myRange = spark.range(1000).toDF("number") # in Python myRange = spark.range(1000).toDF("number") ###########where########## // in Scala val divisBy2 = myRange.where("number % 2 = 0") # in Python divisBy2 = myRange.where("number % 2 = 0") ###########read csv ########## // in Scala val flightData2015 = spark .read .option("inferSchema", "true") .o...

Bagging and Boosting

  What is an Ensemble Method? The ensemble is a method used in the machine learning algorithm. In this method, multiple models or ‘weak learners’ are trained to rectify the same problem and integrated to gain desired results. Weak models combined rightly give accurate models. Bagging Bagging is an acronym for ‘Bootstrap Aggregation’ and is used to decrease the variance in the prediction model. Bagging is a parallel method that fits different, considered learners independently from each other, making it possible to train them simultaneously. Bagging generates additional data for training from the dataset. This is achieved by random sampling with replacement from the original dataset. Sampling with replacement may repeat some observations in each new training data set. Every element in Bagging is equally probable for appearing in a new dataset.  These multi datasets are used to train multiple models in parallel. The average of all the predictions from different ensemble models i...