Holden Karau

San Francisco, California, United States
11K followers 500+ connections

View mutual connections with Holden

Welcome back

Email or phone

Password

Forgot password?

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to follow

Fight Health Insurance

University of Waterloo

About

Software Development Engineer & DA with experience in "big data" and…

Activity

3 years after publication, what's the chance a book is a #1 best seller in all formats on Amazon? Given the several millions of books published, not…

3 years after publication, what's the chance a book is a #1 best seller in all formats on Amazon? Given the several millions of books published, not…

Liked by Holden Karau
🌍 Databricks Runtime 17.1 is now GA. This means we are a few weeks away from full Spatial SQL + Geometry and Geography data type support in DBSQL.

🌍 Databricks Runtime 17.1 is now GA. This means we are a few weeks away from full Spatial SQL + Geometry and Geography data type support in DBSQL.

Liked by Holden Karau
OpenAI's New Open Models are now available on Databricks! Thanks to an awesome collab with #OpenAI and #Databricks engineering + Databricks Mosaic…

OpenAI's New Open Models are now available on Databricks! Thanks to an awesome collab with #OpenAI and #Databricks engineering + Databricks Mosaic…

Liked by Holden Karau

Join now to see all activity

Experience

Fight Health Insurance

San Francisco, California, United States
-

Los Gatos, California, United States
-

San Francisco Bay Area
-

San Francisco, California
-

San Francisco, California
-

San Francisco Bay Area
-

San Francisco Bay Area
-

Mountain View, CA, USA
-

San Francisco Bay Area
-
-
-
-
-
-

Education

University of Waterloo

2004 - 2009

Activities and Societies: Computer Science Club, Computational Mathematics Club

Publications

Scaling Python with Dask: From Data Science to Machine Learning

O'Reilly August 22, 2023
Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn.

Authors Holden Karau and Mika Kimmins show you how to use Dask…

Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn.

Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA.

With this book, you'll learn:

What Dask is, where you can use it, and how it compares with other tools
How to use Dask for batch data parallel processing
Key distributed system concepts for working with Dask
Methods for using Dask with higher-level APIs and building blocks
How to work with integrated libraries such as scikit-learn, pandas, and PyTorch
How to use Dask with GPUs

Other authors
See publication
Scaling Python with Ray

O’Reilly January 3, 2023

Serverless computing enables developers to concentrate solely on their applications rather than worry about where they've been deployed. With the Ray general-purpose serverless implementation in Python, programmers and data scientists can hide servers, implement stateful applications, support direct communication between tasks, and access hardware accelerators.

In this book, experienced software architecture practitioners Holden Karau and Boris Lublinsky show you how to scale existing…

Serverless computing enables developers to concentrate solely on their applications rather than worry about where they've been deployed. With the Ray general-purpose serverless implementation in Python, programmers and data scientists can hide servers, implement stateful applications, support direct communication between tasks, and access hardware accelerators.

In this book, experienced software architecture practitioners Holden Karau and Boris Lublinsky show you how to scale existing Python applications and pipelines, allowing you to stay in the Python ecosystem while reducing single points of failure and manual scheduling. Scaling Python with Ray is ideal for software architects and developers eager to explore successful case studies and learn more about decision and measurement effectiveness.

If your data processing or server application has grown beyond what a single computer can handle, this book is for you. You'll explore distributed processing (the pure Python implementation of serverless) and learn how to:

Implement stateful applications with Ray actors
Build workflow management in Ray
Use Ray as a unified system for batch and stream processing
Apply advanced data processing with Ray
Build microservices with Ray
Implement reliable Ray applications

See publication
Kubeflow for Machine Learning

O'Reilly October 30, 2020

If you're training a machine learning model but aren't sure how to put it into production, this book will get you there. Kubeflow provides a collection of cloud native tools for different stages of a model's lifecycle, from data exploration, feature preparation, and model training to model serving. This guide helps data scientists build production-grade machine learning implementations with Kubeflow and shows data engineers how to make models scalable and reliable.

See publication
High Performance Spark

O'Reilly June 20, 2017
Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with…

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.

With this book, you’ll explore:

How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
The choice between data joins in Core Spark and Spark SQL
Techniques for getting the most out of standard RDD transformations
How to work around performance issues in Spark’s key/value pair paradigm
Writing high-performance Spark code without Scala or the JVM
How to test for functionality and performance when applying suggested improvements
Using Spark MLlib and Spark ML machine learning libraries
Spark’s Streaming components and external community packages

Other authors
See publication
Learning Spark

O'Reilly Feb 2015
The Web is getting faster, and the data it delivers is getting bigger. How can you handle everything efficiently? This book introduces Spark, an open source cluster computing system that makes data analytics fast to run and fast to write. You’ll learn how to run programs faster, using primitives for in-memory cluster computing. With Spark, your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

Written by the developers…

The Web is getting faster, and the data it delivers is getting bigger. How can you handle everything efficiently? This book introduces Spark, an open source cluster computing system that makes data analytics fast to run and fast to write. You’ll learn how to run programs faster, using primitives for in-memory cluster computing. With Spark, your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

Written by the developers of Spark, this book will have you up and running in no time. You’ll learn how to express MapReduce jobs with just a few simple lines of Spark code, instead of spending extra time and effort working with Hadoop’s raw Java API.

Quickly dive into Spark capabilities such as collect, count, reduce, and save
Use one programming paradigm instead of mixing and matching tools such as Hive, Hadoop, Mahout, and S4/Storm
Learn how to run interactive, iterative, and incremental analyses
Integrate with Scala to manipulate distributed datasets like local collections
Tackle partitioning issues, data locality, default hash partitioning, user-defined partitioners, and custom serialization
Use other languages by means of pipe() to achieve the equivalent of Hadoop streaming

Other authors
See publication
Fast Data Processing With Spark

packt Oct 2013

Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.

Fast Data Processing With Spark covers…

Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.

Fast Data Processing With Spark covers how to write distributed map reduce style programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API, to deploying your job to the cluster, and tuning it for your purposes.

Fast Data Processing With Spark covers everything from setting up your Spark cluster in a variety of situations (stand-alone, EC2, and so on), to how to use the interactive shell to write distributed code interactively. From there, we move on to cover how to write and deploy distributed jobs in Java, Scala, and Python.

We then examine how to use the interactive shell to quickly prototype distributed programs and explore the Spark API. We also look at how to use Hive with Spark to use a SQL-like query syntax with Shark, as well as manipulating resilient distributed datasets (RDDs).

See publication

Courses

Compilers

CS444
Real Time Operating Systems

CS452

Projects

Spark Testing Base

Feb 2015

You've written an awesome program in Spark and now its time to write some tests. Only you find yourself writing the code to setup and tear down local mode Spark in between each suite and you say to your self: This is not my beautiful code.
Sparkling Pandas

Oct 2014 - Dec 2014

SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data anlysis with Pandas.

See project
Fast Data Processing with Spark

-
Fast Data Processing with Spark covers how to write distributed map reduce style programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API, to deploying your job to the cluster, and tuning it for your purposes.

Other creators
See project

More activity by Holden

We’re excited to partner with OpenAI to launch their new open source models natively on Databricks! gpt-oss sets a new standard of quality for open…

We’re excited to partner with OpenAI to launch their new open source models natively on Databricks! gpt-oss sets a new standard of quality for open…

Liked by Holden Karau
My friend Bartosz Konieczny recently reminded me that he sent a special copy of his book “Data Engineering Design Patterns”. Having done the…

My friend Bartosz Konieczny recently reminded me that he sent a special copy of his book “Data Engineering Design Patterns”. Having done the…

Liked by Holden Karau
As students get ready to start classes around the world, we're making our most advanced AI tools available to college students in the US, Japan…

As students get ready to start classes around the world, we're making our most advanced AI tools available to college students in the US, Japan…

Liked by Holden Karau
If you'd like an example of leveraging `spaCy` and `GLiNER` to construct _lexical graphs_ from unstructured data sources for use in GraphRAG, etc.…

If you'd like an example of leveraging `spaCy` and `GLiNER` to construct _lexical graphs_ from unstructured data sources for use in GraphRAG, etc.…

Liked by Holden Karau
This photo is from my naturalization ceremony about 5 years ago. On Independence Day, I can't help but reflect on what freedom truly means. Freedom…

This photo is from my naturalization ceremony about 5 years ago. On Independence Day, I can't help but reflect on what freedom truly means. Freedom…

Liked by Holden Karau
I joined Eventual almost 1 year ago and needless to say I'm very glad I made the jump to Eventual and to SF! It's been incredibly fullfilling and…

I joined Eventual almost 1 year ago and needless to say I'm very glad I made the jump to Eventual and to SF! It's been incredibly fullfilling and…

Liked by Holden Karau
GenAI has changed how organizations think about quality. How do you evaluate if a response is “good”? How do you debug multi-step prompt…

GenAI has changed how organizations think about quality. How do you evaluate if a response is “good”? How do you debug multi-step prompt…

Liked by Holden Karau
Is there one of these for the circa 2025 LLM/agentic times? https://xmrrwallet.com/cmx.plnkd.in/g5mwdGPy Please somebody tell me, "Why yes, Bob, I believe there is!"

Is there one of these for the circa 2025 LLM/agentic times? https://xmrrwallet.com/cmx.plnkd.in/g5mwdGPy Please somebody tell me, "Why yes, Bob, I believe there is!"

Liked by Holden Karau
💥 [Trigger Statement That Defies Expectations] I once got [rejected/fired/doubted] for [doing something unconventional]. Now? That same skill…

💥 [Trigger Statement That Defies Expectations] I once got [rejected/fired/doubted] for [doing something unconventional]. Now? That same skill…

Liked by Holden Karau
I'm often asked how Apache DataFusion's Comet accelerator for Apache Spark compares to Apache Gluten + Velox. Honestly, the solutions are very…

I'm often asked how Apache DataFusion's Comet accelerator for Apache Spark compares to Apache Gluten + Velox. Honestly, the solutions are very…

Liked by Holden Karau
Most pipelines force you into rigid OCR-or-nothing workflows. Our approach: define once, execute everywhere. Build a UDF that seamlessly toggles…

Most pipelines force you into rigid OCR-or-nothing workflows. Our approach: define once, execute everywhere. Build a UDF that seamlessly toggles…

Liked by Holden Karau
Documents are COMPLEX objects. Traditional NLP treats documents like flat strings. Real document processing understands geometry. Software Engineer…

Documents are COMPLEX objects. Traditional NLP treats documents like flat strings. Real document processing understands geometry. Software Engineer…

Liked by Holden Karau

View Holden’s full profile

See who you know in common
Get introduced
Contact Holden directly

Join to view full profile

Other similar profiles

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content

Others named Holden Karau

Holden Karau

--

San Francisco, CA

1 other named Holden Karau is on LinkedIn

See others named Holden Karau

Add new skills with these courses

See all courses

Holden Karau

San Francisco, California, United States 11K followers 500+ connections

About

Activity

3 years after publication, what's the chance a book is a #1 best seller in all formats on Amazon? Given the several millions of books published, not…

Liked by Holden Karau

🌍 Databricks Runtime 17.1 is now GA. This means we are a few weeks away from full Spatial SQL + Geometry and Geography data type support in DBSQL.

Liked by Holden Karau

OpenAI's New Open Models are now available on Databricks! Thanks to an awesome collab with #OpenAI and #Databricks engineering + Databricks Mosaic…

Liked by Holden Karau

Experience

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Education

Publications

O'Reilly August 22, 2023

O’Reilly January 3, 2023

O'Reilly October 30, 2020

O'Reilly June 20, 2017

O'Reilly Feb 2015

packt Oct 2013

Courses

Compilers

CS444

Real Time Operating Systems

CS452

Projects

Spark Testing Base

Feb 2015

Oct 2014 - Dec 2014

-

More activity by Holden

We’re excited to partner with OpenAI to launch their new open source models natively on Databricks! gpt-oss sets a new standard of quality for open…

Liked by Holden Karau

My friend Bartosz Konieczny recently reminded me that he sent a special copy of his book “Data Engineering Design Patterns”. Having done the…

Liked by Holden Karau

As students get ready to start classes around the world, we're making our most advanced AI tools available to college students in the US, Japan…

Liked by Holden Karau

If you'd like an example of leveraging `spaCy` and `GLiNER` to construct _lexical graphs_ from unstructured data sources for use in GraphRAG, etc.…

Liked by Holden Karau

This photo is from my naturalization ceremony about 5 years ago. On Independence Day, I can't help but reflect on what freedom truly means. Freedom…

Liked by Holden Karau

I joined Eventual almost 1 year ago and needless to say I'm very glad I made the jump to Eventual and to SF! It's been incredibly fullfilling and…

Liked by Holden Karau

GenAI has changed how organizations think about quality. How do you evaluate if a response is “good”? How do you debug multi-step prompt…

Liked by Holden Karau

Is there one of these for the circa 2025 LLM/agentic times? https://xmrrwallet.com/cmx.plnkd.in/g5mwdGPy Please somebody tell me, "Why yes, Bob, I believe there is!"

Liked by Holden Karau

💥 [Trigger Statement That Defies Expectations] I once got [rejected/fired/doubted] for [doing something unconventional]. Now? That same skill…

Liked by Holden Karau

I'm often asked how Apache DataFusion's Comet accelerator for Apache Spark compares to Apache Gluten + Velox. Honestly, the solutions are very…

Liked by Holden Karau

Most pipelines force you into rigid OCR-or-nothing workflows. Our approach: define once, execute everywhere. Build a UDF that seamlessly toggles…

Liked by Holden Karau

Documents are COMPLEX objects. Traditional NLP treats documents like flat strings. Real document processing understands geometry. Software Engineer…

Liked by Holden Karau

View Holden’s full profile

Other similar profiles

Haishan Ye

Yuriy Semchyshyn

Aditya Das

Tao Lin

Akshay Jain

Zhengyi Liu

Jun Zhou

Shenghan Gao

Xuetao Yin

Qian Yu

San Francisco, California, United States
11K followers 500+ connections