Sai ParvathaneniSpark Optimization Techniques: Repartition() and Coalesce()Understanding Data Skewness in Apache SparkAug 27, 2024Aug 27, 2024
Sai ParvathaneniSpark Optimization Techniques: groupByKey() and reduceByKey()Understanding Shuffle in Apache SparkAug 26, 20242Aug 26, 20242
Sai ParvathaneniSpark Optimization Techniques: Broadcast() JoinA Broadcast Join in Apache Spark is an optimization technique that is used to improve the performance of joins involving a large dataset…Aug 28, 2024Aug 28, 2024
Sai ParvathaneniSpark Optimization Techniques: Cache() and Persist()When working with Apache Spark, optimizing the performance of your Spark jobs is crucial, especially when dealing with large datasets and…Aug 30, 20241Aug 30, 20241
Sai ParvathaneniSpark Optimization Techniques: Types of Joins — Visual RepresentationJoining datasets is a common operation in data processing, and Apache Spark provides several ways to perform joins efficiently. However…Sep 1, 2024Sep 1, 2024
Sai ParvathaneniMastering Spark Query Plans: Narrow TransformationsIn this article, we will explore narrow transformations in Apache Spark. We will go through the query plan produced by Spark and explain…Sep 4, 20241Sep 4, 20241
Sai ParvathaneniMastering Spark Query Plans: Wide Transformations — Repartition and CoalesceIn Spark, transformations can be categorized into narrow and wide transformations. While narrow transformations operate within the same…Sep 6, 2024Sep 6, 2024
Sai ParvathaneniMastering Spark Query Plans: Wide Transformations — JoinsIn Spark, joins are considered wide transformations because they require shuffling of data across the cluster. When you join two…Sep 8, 2024Sep 8, 2024
Sai ParvathaneniMastering Spark Query Plans: Wide Transformations — GroupBy AggregationsIn Spark, groupBy transformations are wide operations because they involve shuffling data across the cluster to group similar keys…Sep 10, 2024Sep 10, 2024
Sai ParvathaneniSpark Optimization Techniques: The Role of SerializationIn distributed systems like Apache Spark, performance optimization is crucial for ensuring that data processing jobs are efficient…Sep 10, 2024Sep 10, 2024
InTowards DevbySai ParvathaneniBuilding a Streaming Data Pipeline: Spark vs. Flink Comparison with Kafka IntegrationWhen handling streams of data, two prominent frameworks that often come into play are Apache Spark and Apache Flink. Both are commonly used…Sep 13, 2024Sep 13, 2024
Sai ParvathaneniSpark Optimization Techniques: Predicate PushdownApache Spark is a powerful tool for processing massive datasets, and part of what makes it so effective is its ability to scale and perform…Sep 15, 2024Sep 15, 2024
InTowards DevbySai ParvathaneniApache Spark for Dummies: Part 1 Architecture and RDDsThis series will be your ultimate guide to Apache Spark, I promise.May 19, 2023May 19, 2023
Sai ParvathaneniShort Reads: SparkContext and SparkSession: Pizza AnalogyLet’s consider SparkContext and SparkSession. Just as you would handle pizza, from making the dough, adding toppings, to finally baking it…Jun 2, 2023Jun 2, 2023
InTowards DevbySai ParvathaneniApache Spark for Dummies Part 2: DataFrames, Datasets, and Spark SQLWelcome back to our exploration into Apache Spark! In Part 1 of our series, we delved into the basics of Spark and got our hands dirty with…May 29, 20232May 29, 20232
InTowards DevbySai ParvathaneniBuilding a Real-time Log Monitoring System with Kafka and Spark StreamingBy leveraging Kafka for data streaming and integration, and Spark for real-time data processing and analysis, organizations can achieve a…Jun 6, 20231Jun 6, 20231
Sai ParvathaneniApache Spark for Dummies Part 3: Data Processing and AnalysisWelcome back to the third part of our series on Apache Spark. We’ve discussed Apache Spark’s architecture, RDDs, DataFrames, and Datasets…Jun 11, 2023Jun 11, 2023
Sai ParvathaneniApache Spark for Dummies: Part 4 — Advanced Spark FeaturesWelcome to the fourth part of our Apache Spark series. In this segment, we will explore some of the advanced features of Apache Spark that…Jun 17, 2023Jun 17, 2023