Skip to main content

spark-Pyspark

 spark :


in-memery computation
spark : Data extracting from hard disk and store into ram. and each step operation are store in ram and only after completion of job it write to hard disk.

hadoop mapreduce : it perform some operation and store to hard drive every time for each step. it read and write from hard disk every step of the job. so latency is high.

Lazy execution :

when applying function to read the data it does not read because we do not performing any opearaiton. it does not read data until we perform some operation or computation. suppose we do in pandas pd.read_csv it read data store in ram.

Parallel Processing :

distributed the data into different cluster and stored in nodes.

batch processing and real-time processing ex. credit card transaction. genuine or fake





Popular posts from this blog

deploying Machine learning Model : pkl, Flask,postman

1)Create model and train          #  importing Librarys         import pandas as pd         import numpy as np         import matplotlib . pyplot as plt         import seaborn as sns         import requests         from pickle import dump , load         # Load Dataset         url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"         names = [ "sepal_length" , "sepal_width" , "petal_length" , "petal_width" , "species" ]         # Loading Dataset         df = pd . read_csv ( url , names = names )         df . tail ( 11 )         df . columns         test = [         { 'sepal_length' : 5.1 , 'sepal_width' : 3.5 , 'peta...

spark-scala-python

 ############sparkcontest######33333 it is used in earlier spark 1.x //scala  import org.apache.spark.SparkConf     import org.apache.spark.SparkContext     val conf = new SparkConf().setAppName("first").setMaster("local[*]")     val sc = new SparkContext(conf) val rdd1 = sc.textFile("C:/workspace/data/txns") # python  from pyspark import SparkContext,SparkConf     conf = SparkConf().setAppName("first").setMaster("local[*])     sc = SparkContext(conf)      ## now days sparksession are used  ########range######### // in Scala val myRange = spark.range(1000).toDF("number") # in Python myRange = spark.range(1000).toDF("number") ###########where########## // in Scala val divisBy2 = myRange.where("number % 2 = 0") # in Python divisBy2 = myRange.where("number % 2 = 0") ###########read csv ########## // in Scala val flightData2015 = spark .read .option("inferSchema", "true") .o...

Binomial Distribution

  The binomial distribution formula is:                                                    b(x; n, P) =  n C x  * P x  * (1 – P) n – x Where: b = binomial probability x = total number of “successes” (pass or fail, heads or tails etc.) P = probability of a success on an individual trial n = number of trials Note:  The binomial distribution formula can also be written in a slightly different way, because  n C x  = n! / x!(n – x)! (this binomial distribution formula uses factorials  (What is a factorial? ). “q” in this formula is just the probability of failure (subtract your probability of success from 1). Using the First Binomial Distribution Formula The binomial distribution formula can calculate the probability of success for binomial distributions. Often you’ll be told to “plug in” the numbers to the  formul...