Guides

How to Analyze Data with PandasAI

Arkzero ResearchApr 24, 20267 min read

Last updated Apr 24, 2026

PandasAI is a Python library that lets you query your data in plain English instead of writing code. You point it at a CSV file, SQL database, or pandas DataFrame, ask a question like "what was the top product by revenue last quarter?", and it generates and runs the analysis for you, returning the result or a chart. Setup requires Python 3.8 or later and an OpenAI API key, and takes about 10 minutes.
How to Analyze Data with PandasAI hero image

PandasAI turns your dataframes into something you can talk to. Instead of writing groupby statements or pivot table logic, you type a question in plain English and PandasAI generates Python code, executes it against your data, and returns the answer directly. It works with CSV files, SQL databases, Excel sheets, and standard pandas DataFrames. The library has accumulated over 16,500 GitHub stars and is used by data teams at companies ranging from early-stage startups to large enterprises. This guide walks through installation, a basic CSV workflow, chart generation, multi-file queries, and the most common points of failure.

What You Need Before Starting

Before installing PandasAI, confirm you have:

  • Python 3.8 or later (check with python --version)
  • pip package manager
  • An OpenAI API key with billing enabled

PandasAI also works with Anthropic Claude, Google Gemini, and local models via Ollama. OpenAI GPT-4o is the most tested option and the one used in examples below. API costs for typical analysis sessions are low. A session querying a mid-size CSV of 10,000 rows runs well under $0.05 using GPT-4o's current pricing of $2.50 per million input tokens and $10 per million output tokens.

Installing PandasAI

Install using pip:

pip install pandasai

If you plan to generate charts, also install matplotlib:

pip install matplotlib

PandasAI version 3 changed the core API significantly. Older tutorials using PandasAI(llm) and pandas_ai.run(df, "...") are v1 patterns that no longer work with current builds. This guide uses the v3 interface throughout.

Configuring Your API Key

PandasAI reads LLM credentials from environment variables. Set your OpenAI key before running any analysis:

export OPENAI_API_KEY="sk-..."

Or set it at the top of your Python script:

import os
os.environ["OPENAI_API_KEY"] = "sk-..."

To use Anthropic Claude instead, set ANTHROPIC_API_KEY and change the model string in the config step below.

Analyzing a CSV File

This example uses a sales CSV with columns for date, product, region, units, and revenue. The steps apply to any tabular CSV regardless of domain.

import pandasai as pai

# point to your model
pai.config.set({
    "llm": "openai/gpt-4o"
})

# load the data
df = pai.read_csv("sales_data.csv")

# ask a question
result = df.chat("What is total revenue by region, ranked highest to lowest?")
print(result)

PandasAI generates Python code internally, executes it against the loaded dataframe, and returns the result. Running a second df.chat() call in the same session keeps context alive, so you can follow up without reloading the file:

result2 = df.chat("Which region had the largest month-over-month growth in March?")

The library sends column names, data types, and a small sample of rows to the model on each call. It does not send your full dataset to the API, which matters for privacy with large or sensitive files.

Reading Excel Files

Excel files work identically to CSVs:

df = pai.read_excel("q1_report.xlsx")
result = df.chat("Summarize the key trends from this data")

For Excel workbooks with multiple sheets, load the specific sheet by name:

import pandas as pd
raw = pd.read_excel("workbook.xlsx", sheet_name="Revenue")
df = pai.DataFrame(raw)
result = df.chat("What is the average deal size by sales rep?")

The pai.DataFrame() wrapper gives any existing pandas DataFrame the .chat() interface. Use this pattern whenever you have already loaded or transformed data before handing it to PandasAI.

Connecting to a SQL Database

For database sources, load with pandas and wrap the result:

import pandas as pd
import sqlalchemy
import pandasai as pai

pai.config.set({"llm": "openai/gpt-4o"})

engine = sqlalchemy.create_engine("postgresql://user:pass@host/dbname")
raw = pd.read_sql("SELECT * FROM orders WHERE created_at > '2026-01-01'", engine)

df = pai.DataFrame(raw)
result = df.chat("What percentage of orders were refunded, broken out by product category?")
print(result)

This pattern works with any SQLAlchemy-compatible database including PostgreSQL, MySQL, SQLite, and Snowflake (via snowflake-sqlalchemy).

Generating Charts

Ask PandasAI to plot directly in the same prompt:

df.chat("Generate a bar chart showing monthly revenue for each region")

The chart is saved as a PNG to a charts/ folder in your working directory. To specify a different path:

pai.config.set({"save_charts_path": "/output/charts"})

Standard bar, line, pie, and scatter plots work reliably. If you need a dual-axis chart or a specific color scheme, include that detail in the prompt: "Generate a line chart with two y-axes, one for revenue and one for unit volume, with revenue in blue."

In headless environments (servers, CI pipelines), set the matplotlib backend before any imports to prevent display errors:

import matplotlib
matplotlib.use('Agg')

Querying Multiple Dataframes Together

PandasAI handles cross-file questions without explicit join syntax. If you have a customers file and an orders file, pass both to pai.chat():

customers = pai.read_csv("customers.csv")
orders = pai.read_csv("orders.csv")

result = pai.chat(
    "How many orders did each customer segment place in Q1?",
    customers,
    orders
)

PandasAI infers join conditions from column names. It handles straightforward foreign keys like customer_id automatically. For ambiguous schemas, add a short description to the prompt: "customers.id matches orders.customer_id." For more than two dataframes, the same pattern extends by adding more arguments.

Common Failure Modes and Fixes

Incorrect aggregations on complex questions. PandasAI generates code probabilistically. For multi-step calculations, split the question into two simpler prompts and chain the results. For example, instead of asking for a period-over-period growth rate in a single prompt, first ask for totals per period, then ask for the growth calculation.

API rate limit errors. If you loop through many questions programmatically, add time.sleep(1) between calls to stay within OpenAI's requests-per-minute limits.

Column not found errors. PandasAI reads column names from the dataframe schema. If your CSV has inconsistent casing, normalize before loading:

import pandas as pd
raw = pd.read_csv("data.csv")
raw.columns = raw.columns.str.lower().str.replace(" ", "_")
df = pai.DataFrame(raw)

Chart not saved. Confirm matplotlib is installed (pip install matplotlib) and the output directory is writable.

Reducing API Costs

Each df.chat() call sends the column schema and a data sample. For large files, load only the columns your questions will reference:

raw = pd.read_csv("large_file.csv", usecols=["date", "revenue", "region", "product"])

For fully offline use, PandasAI supports local models via Ollama. Replace the config line with:

pai.config.set({"llm": "ollama/llama3"})

This eliminates API costs entirely. Response quality and code generation accuracy are lower than GPT-4o for complex analytical questions, but local models handle straightforward aggregations and summaries well.

When Setup Is a Barrier

The steps above assume you are comfortable opening a terminal, installing Python packages, and handling environment variables. If that is not you, or you want to go straight to analysis without any configuration, VSLZ lets you upload a CSV and start asking questions immediately with no local setup required.

Summary

PandasAI v3 uses pai.read_csv(), pai.DataFrame(), and df.chat() as its core interface. It supports CSV, Excel, SQL databases, and multi-file joins. Charts generate to PNG automatically with matplotlib installed. Typical sessions cost under $0.05 using GPT-4o. For complex aggregations, decompose questions into smaller steps for more reliable results. For offline or zero-cost use, swap to a local Ollama model.

FAQ

Does PandasAI send my full dataset to OpenAI?

No. PandasAI sends column names, data types, and a small sample of rows to the language model, not the full dataset. The generated Python code runs locally on your machine against the complete data. For highly sensitive data, you can also run PandasAI with a local model through Ollama, keeping all data entirely offline.

Which LLMs does PandasAI support?

PandasAI supports OpenAI models (GPT-4o, GPT-4, GPT-3.5), Anthropic Claude, Google Gemini, and local models through Ollama and LiteLLM. You switch models by changing the string in pai.config.set(). GPT-4o is the most widely tested and produces the most reliable code for analytical queries.

What is the difference between PandasAI v1 and v3?

In PandasAI v1, you instantiated a PandasAI object with an LLM and called pandas_ai.run(df, question). In v3, you load data with pai.read_csv() or pai.DataFrame() and call df.chat(question) directly. The v3 API also adds native multi-dataframe support via pai.chat(question, df1, df2) and a centralized config system. Most tutorials published before 2025 use the v1 pattern.

How much does it cost to use PandasAI with OpenAI?

Costs depend on question complexity and dataset size. A typical analysis session with 10 to 20 questions on a CSV of 10,000 rows uses well under $0.10 of GPT-4o tokens at current OpenAI pricing ($2.50 per million input tokens, $10 per million output tokens as of 2026). PandasAI samples your data rather than sending the full file, which keeps token usage low.

Can PandasAI generate charts automatically?

Yes. Ask for a chart in the prompt (for example, 'Generate a bar chart showing revenue by region') and PandasAI generates Python code that produces and saves a PNG to a charts/ directory in your working directory. Matplotlib must be installed separately with pip install matplotlib. You can configure a different output path via pai.config.set({'save_charts_path': '/your/path'}).

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026