Guides

How to Query S3 Files with Amazon Athena

Arkzero ResearchApr 25, 20267 min read

Last updated Apr 25, 2026

Amazon Athena lets you run SQL directly on files stored in S3 with no database to set up or maintain. You point Athena at a bucket, define a table schema in the AWS Glue Data Catalog, and start querying CSV, JSON, or Parquet files within minutes. Athena charges $5 per terabyte scanned, so converting raw CSV to Parquet before querying can cut costs by 60 to 80 percent on the same workload.
AWS S3 and Amazon Athena serverless SQL query service

Amazon Athena is a serverless query service from AWS that runs standard SQL against files sitting in S3. There is no cluster to spin up, no warehouse to configure, and no data to move. You point Athena at a bucket, define a table, and run SELECT statements. Results appear in seconds to minutes depending on data volume, and you pay only for the bytes scanned.

Athena is particularly useful for teams that already store operational exports in S3 -- sales data from a CRM, event logs from an app, financial exports from accounting tools -- and want to analyze them without moving the data to a separate database.

What You Need Before Starting

You need an active AWS account with permission to access S3 and Athena, and at least one S3 bucket containing data you want to query. Supported formats include CSV, TSV, JSON, Parquet, ORC, and Avro. A second S3 bucket or prefix for storing query results is also required. If your data is already in S3, the full setup takes about 15 minutes.

Step 1: Create a Results Bucket

Athena writes every query result to S3. Before opening the Athena console, create a dedicated bucket or prefix for those results. In the S3 console, create a bucket named something like my-athena-results. Keep it in the same AWS region you plan to use for Athena. Cross-region traffic adds latency and cost.

Step 2: Configure the Query Results Location

Open the Athena console and navigate to Settings > Manage. Set the Query result location to s3://my-athena-results/ and click Save. If you skip this step, every query fails with a results location error. The setting applies per workgroup, so configure it once per workgroup.

Consider setting a query result retention policy in the same screen. Seven days is a reasonable default. Old results accumulate silently and add storage cost over time.

Step 3: Create a Database

Athena uses the AWS Glue Data Catalog to store table definitions. Run this in the Athena query editor to create a database namespace:

CREATE DATABASE my_data;

All tables you create live inside this database. If you work with multiple data sources or teams, use separate databases to keep schemas organized.

Step 4: Define a Table

This is where Athena differs from a traditional database. You are not loading data into anything -- you are pointing Athena at data that already lives in S3. The table definition maps your S3 files to a SQL schema.

Here is a working example for a CSV file with headers:

CREATE EXTERNAL TABLE my_data.sales_2025 (
  order_id STRING,
  customer_id STRING,
  product_name STRING,
  revenue DOUBLE,
  order_date STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar' = ',',
  'quoteChar' = '"',
  'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://my-data-bucket/sales/'
TBLPROPERTIES ('skip.header.line.count'='1');

Two things to get right. First, all column names must be lowercase. Athena rejects tables with uppercase column names when interacting with the Glue Data Catalog. Second, use EXTERNAL TABLE rather than CREATE TABLE. An external table does not own the data -- dropping the table removes only the schema definition, not the files in S3.

Once the table is created, test it:

SELECT * FROM my_data.sales_2025 LIMIT 10;

If rows return, your table is working.

Step 5: Run Useful Queries

Athena supports standard SQL including aggregations, joins, window functions, and subqueries. A few practical starting points:

Total revenue by product:

SELECT product_name, SUM(revenue) AS total_revenue
FROM my_data.sales_2025
GROUP BY product_name
ORDER BY total_revenue DESC;

Orders per day:

SELECT order_date, COUNT(*) AS order_count
FROM my_data.sales_2025
GROUP BY order_date
ORDER BY order_date;

Join two S3 datasets:

SELECT s.order_id, s.revenue, c.customer_name
FROM my_data.sales_2025 s
JOIN my_data.customers c ON s.customer_id = c.customer_id
WHERE s.revenue > 500;

Each query scans the full file set in the table location unless you partition your data or convert to a columnar format.

Cutting Costs with Parquet

Athena charges $5 per terabyte of data scanned. A single query against a 10 GB CSV file costs roughly $0.05. That sounds negligible, but 50 analysts running 20 queries per day on a 100 GB dataset generates about 100 TB of scanning per month -- a $500 monthly bill from query costs alone.

Converting that same data to Parquet changes the math. Parquet is a columnar format, so Athena reads only the columns your query actually references rather than scanning every row. A query referencing 3 of 40 columns reads about 7.5 percent of the data volume. On a 100 GB dataset queried 1,000 times per day, that drops monthly query costs from around $500 to roughly $40.

You can convert CSV to Parquet locally with a short Python script:

import pandas as pd
df = pd.read_csv('sales_2025.csv')
df.to_parquet('sales_2025.parquet', index=False)

Upload the Parquet file to S3, then create a new table using STORED AS PARQUET instead of TEXTFILE. Query performance also improves because Parquet metadata lets Athena skip entire row groups that do not match your filter conditions.

Using Partitions

Partitioning splits your data into S3 subdirectories by column value, typically date or region. When a query filters on the partition column, Athena scans only the matching directories and ignores everything else.

A partitioned folder structure looks like:

s3://my-data-bucket/sales/year=2025/month=01/
s3://my-data-bucket/sales/year=2025/month=02/
s3://my-data-bucket/sales/year=2026/month=04/

A query filtered to year=2026 AND month=04 scans only that single directory. For large historical datasets, date partitioning reduces query scan volume by 90 percent or more on typical date-range filters.

Workgroups for Cost Control

Athena workgroups let you enforce data scan limits per query and separate billing by team or project. In the Athena console, navigate to Workgroups and create one with a per-query scan limit set to 1 GB. This prevents a single ad hoc query from accidentally scanning an entire multi-terabyte dataset.

Each workgroup can have its own results bucket, which keeps query history separated when multiple teams share the same AWS account.

Common Setup Issues

Three problems account for most early Athena failures.

Table not found after creation. If Athena returns a "table does not exist" error after you created the table, check that you selected the correct database in the dropdown on the left side of the query editor. Athena does not switch databases automatically between sessions.

Query results location not set. Every new workgroup requires a results location. The error reads "Query result location is required." Go to Workgroup settings and set the S3 output path before running any queries.

Uppercase column names. If your DDL contains a column named OrderID instead of order_id, Athena creates the table but queries against Glue-backed schemas fail with a metadata mismatch. Rename all columns to lowercase before creating your table.

What Athena Does Not Replace

Athena is a query layer, not a data warehouse. It handles exploratory analysis, ad hoc queries, and scheduled reports on relatively static S3 data well. It is not suited for frequent writes, transactional workloads, or sub-second dashboard response times. For those use cases, a warehouse like Redshift or BigQuery is a better fit.

If you want to skip the S3 and Glue setup entirely and go straight from a file upload to SQL-backed insights, VSLZ lets you get there from a single prompt with no infrastructure required.

FAQ

How much does Amazon Athena cost?

Athena charges $5 per terabyte of data scanned by your queries. There is no charge for DDL statements, failed queries, or queries against metadata. The practical cost depends heavily on data format: CSV files get scanned in full, while Parquet files with selective column queries can reduce scan volume by 80 to 90 percent. Many teams find their Athena bill is negligible once they partition data and convert to columnar formats.

Can Amazon Athena query CSV files directly from S3?

Yes. Athena supports querying CSV, TSV, JSON, Parquet, ORC, and Avro files stored in S3. CSV is the most common starting point. You define an EXTERNAL TABLE using the OpenCSVSerde serializer, point the LOCATION at your S3 prefix, and set skip.header.line.count to 1 if your file has a header row. No data migration or transformation is required to start querying.

What is the difference between Amazon Athena and Redshift?

Athena is serverless and queries files directly in S3 with no setup. You pay per query by data volume scanned. Redshift is a managed data warehouse where you load data into structured tables, pay for cluster uptime, and get faster query performance on large, complex workloads. Athena is better for ad hoc analysis of S3 data. Redshift is better for production dashboards and frequent, complex analytical queries where consistent performance matters.

How do I reduce Amazon Athena query costs?

Three approaches have the largest impact. First, convert CSV files to Parquet format before querying -- Parquet reads only the columns your query references, typically reducing scan volume by 70 to 90 percent. Second, partition your data by date or another high-selectivity column so queries only scan relevant directories. Third, use Athena workgroups to set per-query scan limits, which prevents expensive runaway queries. Combining all three can cut a $500 monthly query bill to under $50 on the same workload.

Do I need to know SQL to use Amazon Athena?

Athena uses standard ANSI SQL (Presto SQL dialect), so basic SQL knowledge is required to write queries. You need to know SELECT, WHERE, GROUP BY, and JOIN at minimum. The AWS Athena console includes a query history and table preview that help beginners get started. For teams that want to analyze data without writing SQL, tools that layer a natural language interface on top of Athena -- or provide a fully managed query layer -- are a practical alternative.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026