Hi,
I have a text file that needs parsing and I have absolutely no idea on how I'm going to do it.
The data looks like this;
[code]
JAN Year FEB Year MAR Year APR Year MAY Year JUN Year JUL Year AUG Year SEP Year OCT Year NOV Year DEC Year WIN Year SPR Year SUM Year AUT Year ANN Year
267.8 1928 242.0 2002 218.3 1981 146.3 1920 162.4 1967 205.4 2012 196.0 1939 260.9 1956 287.1 1918 280.9 1967 306.6 2009 258.5 1993 602.5 1995 455.8 1920 488.9 2012 667.8 1954 1689.3 1954
[/code]
Could anyone give me any pointers? No pun intended.
You start with a record for the data, whatever you want to store, like
[code]record Data
{
Int Year;
String Month; //maybe Int
Int Number; //I have no idea what the number is for
}[/code]
I chose an integer for the number instead of a floating point datatype, because it avoids issues related to imprecise representation and rounding; you can just multiply/divide 10 for input/output.
To store these get a vector (dynamic array), or if it's always going to be 17 sets, then an array of 17 elements.
Then you either loop through each element (in the 17 elements case) or read until newline (\n).
You can just read the string for the month (and possibly convert it to an integer) and skip the "Year" string, although you might want to ensure that the string is indeed present in the file.
When this is done, you proceed to read the numbers and fill in the rest for the partially set records. You should probably ensure that you read exactly the amount as expected (which is the number of elements in the vector/array from the first loop).
Would be really easy for me to do in Python, what language are you using?
[code]
#include <iostream>
#include <vector>
#include <map>
using namespace std;
typedef pair<double, int> value_year;
int main() {
int n_lines = 1;
// Read first line and store each month in a lookup vector
vector<string> months;
for (int i = 0; i < 17; i++) {
string month;
cin >> month;
months.push_back(month);
cin >> month; // Year
}
// in_data["MONTH"][i].first = value
// in_data["MONTH"][i].second = year
map<string, vector<value_year> > in_data;
// Read next lines
for (int i = 0; i < n_lines; i++) {
for (int j = 0; j < 17; j++) {
double value;
int year;
cin >> value >> year;
in_data[months[j]].push_back(value_year(value, year));
}
}
cout << in_data["SPR"][0].first << " " << in_data["SPR"][0].second << endl;
return 0;
}
[/code]
I'm pretty sure this can be done in like 2 lines in Python. hue
[QUOTE=adnzzzzZ;40395990]-code-
I'm pretty sure this can be done in like 2 lines in Python. hue[/QUOTE]
In fact...
[code]>>> l0, l1 = (l.strip().split() for l in open('data.txt', 'r').readlines())
>>> print '\n'.join('%s %s: %s' % d for d in zip(l1[1::2], l0[::2], l1[::2]))
1928 JAN: 267.8
2002 FEB: 242.0
1981 MAR: 218.3
1920 APR: 146.3
1967 MAY: 162.4
2012 JUN: 205.4
1939 JUL: 196.0
1956 AUG: 260.9
1918 SEP: 287.1
1967 OCT: 280.9
2009 NOV: 306.6
1993 DEC: 258.5
1995 WIN: 602.5
1920 SPR: 455.8
2012 SUM: 488.9
1954 AUT: 667.8
1954 ANN: 1689.3[/code]
[editline]24th April 2013[/editline]
Actually, this is a little better:
[code]>>> lsplit = [l.strip().split() for l in open('data.txt', 'r').readlines()]
>>> data = zip(map(int, lsplit[1][1::2]), lsplit[0][::2], map(float, lsplit[1][::2]))
>>> print '\n'.join('%d %s: %.1f' % d for d in data)
1928 JAN: 267.8
2002 FEB: 242.0
1981 MAR: 218.3
1920 APR: 146.3
1967 MAY: 162.4
2012 JUN: 205.4
1939 JUL: 196.0
1956 AUG: 260.9
1918 SEP: 287.1
1967 OCT: 280.9
2009 NOV: 306.6
1993 DEC: 258.5
1995 WIN: 602.5
1920 SPR: 455.8
2012 SUM: 488.9
1954 AUT: 667.8
1954 ANN: 1689.3[/code]
[I]Edit[/I]: Fixed redundant parameter packing/unpacking.
[QUOTE=Anonim;40396026][code]>>> data = zip(*[map(int, lsplit[1][1::2]), lsplit[0][::2], map(float, lsplit[1][::2])])
[/code][/QUOTE]
Is there any purpose for putting the data into a list and immediately unpacking it?
[QUOTE=sixtyten;40397833]Is there any purpose for putting the data into a list and immediately unpacking it?[/QUOTE]
Oops. Forgot to pay attention after an early attempt.
Fixed now.
[QUOTE=Shadaez;40395416]Would be really easy for me to do in Python, what language are you using?[/QUOTE]
Java/Processing
I can probably use python though.
[code]>>> lsplit = [l.strip().split() for l in open('data.txt', 'r').readlines()]
>>> data = zip(map(int, lsplit[1][1::2]), lsplit[0][::2], map(float, lsplit[1][::2]))
>>> print '\n'.join('%d %s: %.1f' % d for d in data)
1928 JAN: 267.8
2002 FEB: 242.0
1981 MAR: 218.3
1920 APR: 146.3
1967 MAY: 162.4
2012 JUN: 205.4
1939 JUL: 196.0
1956 AUG: 260.9
1918 SEP: 287.1
1967 OCT: 280.9
2009 NOV: 306.6
1993 DEC: 258.5
1995 WIN: 602.5
1920 SPR: 455.8
2012 SUM: 488.9
1954 AUT: 667.8
1954 ANN: 1689.3[/code]
That's amazing and exactly what I need, would you mind breaking down your code if that's not too much to ask?
[code]open('data.txt', 'r').readlines()[/code]
open data.txt in read mode and return a string-iterator, iterating over each line
[code]for l in open(...).readlines()[/code]
do stuff for each string l in the iterator
[code]l.strip().split()[/code]
strip (front and trailing I presume) whitespace from the string and split at whitespace
lsplit is then essentially your file represented as a 2D array, where each element is a line and the line is an array where each element is a word.
[code]map(int, lsplit[1][1::2])[/code]
from the second entry in lsplit (second line in file), convert each entry to an int, starting with the second element (python arrays are 0 based, so the 1 in 1::2 is the second element) and only every second element (that's the 2 in 1::2)
so essentially filter out every year and convert to an int
[code]lsplit[0][::2][/code]
simply get every second element from the first element of lsplit (the first line), so essentially filter out every "month"
[code]map(float, lsplit[1][::2])[/code]
alike the first map(...), only converting to a float and starting with the first, not second element
this filters out the other number
[code]zip(...)[/code]
This puts the three arrays together. An example probably shows this best:
[code]>>> zip([1,2,3], ['a','b','c'], [-1,-2,-3])
[(1, 'a', -1), (2, 'b', -2), (3, 'c', -3)][/code]
(python3 will print the address of the array in the memory, but python2 prints the contents)
[code]'\n'.join(...)[/code]
join each element seperated by a new line
[code]'%d %s: %.1f' % ...[/code]
create a formatted string (... is the input), %d is for integer, %s for string, %.1f for floatingpoint with one digit after the decimal point
[code]d for d in data[/code]
you can probably guess this, but basically for each entry in the data array
[QUOTE=jaooe;40398814]-code-
That's amazing and exactly what I need, would you mind breaking down your code if that's not too much to ask?[/QUOTE]
ZeekyHBomb has already explained it, but here's a more verbose way of writing the same code:
[code]In [1]: raw_lines = open('data.txt', 'r').readlines()
In [2]: line_split = [line.strip().split() for line in raw_lines]
In [3]: years = map(int, line_split[1][1::2])
In [4]: months = line_split[0][::2]
In [5]: values = map(float, line_split[1][::2])
In [6]: data = zip(years, months, values)
In [7]: raw_lines
Out[7]:
[' JAN Year FEB Year MAR Year APR Year MAY Year JUN Year JUL Year AUG Year SEP Year OCT Year NOV Year DEC Year WIN Year SPR Year SUM Year AUT Year ANN Year',
' 267.8 1928 242.0 2002 218.3 1981 146.3 1920 162.4 1967 205.4 2012 196.0 1939 260.9 1956 287.1 1918 280.9 1967 306.6 2009 258.5 1993 602.5 1995 455.8 1920 488.9 2012 667.8 1954 1689.3 1954']
In [8]: print '[%s]' % '\n'.join(str(l) for l in line_split)
[['JAN', 'Year', 'FEB', 'Year', 'MAR', 'Year', 'APR', 'Year', 'MAY', 'Year', 'JUN', 'Year', 'JUL', 'Year', 'AUG', 'Year', 'SEP', 'Year', 'OCT', 'Year', 'NOV', 'Year', 'DEC', 'Year', 'WIN', 'Year', 'SPR', 'Year', 'SUM', 'Year', 'AUT', 'Year', 'ANN', 'Year']
['267.8', '1928', '242.0', '2002', '218.3', '1981', '146.3', '1920', '162.4', '1967', '205.4', '2012', '196.0', '1939', '260.9', '1956', '287.1', '1918', '280.9', '1967', '306.6', '2009', '258.5', '1993', '602.5', '1995', '455.8', '1920', '488.9', '2012', '667.8', '1954', '1689.3', '1954']]
In [9]: print str(years)
[1928, 2002, 1981, 1920, 1967, 2012, 1939, 1956, 1918, 1967, 2009, 1993, 1995, 1920, 2012, 1954, 1954]
In [10]: print str(months)
['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC', 'WIN', 'SPR', 'SUM', 'AUT', 'ANN']
In [11]: print str(values)
[267.8, 242.0, 218.3, 146.3, 162.4, 205.4, 196.0, 260.9, 287.1, 280.9, 306.6, 258.5, 602.5, 455.8, 488.9, 667.8, 1689.3]
In [12]: data
Out[12]:
[(1928, 'JAN', 267.8),
(2002, 'FEB', 242.0),
(1981, 'MAR', 218.3),
(1920, 'APR', 146.3),
(1967, 'MAY', 162.4),
(2012, 'JUN', 205.4),
(1939, 'JUL', 196.0),
(1956, 'AUG', 260.9),
(1918, 'SEP', 287.1),
(1967, 'OCT', 280.9),
(2009, 'NOV', 306.6),
(1993, 'DEC', 258.5),
(1995, 'WIN', 602.5),
(1920, 'SPR', 455.8),
(2012, 'SUM', 488.9),
(1954, 'AUT', 667.8),
(1954, 'ANN', 1689.3)][/code]
Yeah, sometimes less lines isn't always better when it gets that hard to understand.
It does look cool, though.
Thanks Anonim and ZeekyHBomb for those thorough explanations, the data I posted was just a snippet of the actual data the difference being there are many more rows of values and years on the actual data. Is there anything I need to think about bearing that in mind?
And for those of you who are wondering what the data is about, it's rainfall data that's going to be used for a visualization project for uni.
[QUOTE=jaooe;40416779]Thanks Anonim and ZeekyHBomb for those thorough explanations, the data I posted was just a snippet of the actual data the difference being there are many more rows of values and years on the actual data. Is there anything I need to think about bearing that in mind?
And for those of you who are wondering what the data is about, it's rainfall data that's going to be used for a visualization project for uni.[/QUOTE]
Does your parser need to be more general than for reading alternating lines rigidly following some well-defined syntax, or is it fine if it breaks at the slightest inconsistency? Can you write it in such a way that once you've read one row of data, it's trivial to expand it to parse multiple rows? Think these things through before coding up a solution.
That said, if we throw those tips out the window and go for conciseness, assuming the data alternates on every other line like this:
[code][MONTH] Year [MONTH] Year [MONTH] Year
[VALUE] [YEAR] [VALUE] [YEAR] [VALUE] [YEAR]
[MONTH] Year [MONTH] Year [MONTH] Year
[VALUE] [YEAR] [VALUE] [YEAR] [VALUE] [YEAR]
[MONTH] Year [MONTH] Year [MONTH] Year
[VALUE] [YEAR] [VALUE] [YEAR] [VALUE] [YEAR]
...[/code]
then you can still do this in two lines of code in Python:
[code]lsplit = [l.strip().split() for l in open(file_name, 'r').readlines()]
data = [zip(map(int, valueline[1::2]), headline[::2], map(float, valueline[::2])) for (headline, valueline) in zip(lsplit[::2], lsplit[1::2])][/code]
But then again, conciseness isn't really important. You might want to go for robustness and readability instead -- this 2-liner is ugly and breaks if there are empty lines.
I need to have it in the form of;
[code]
year period value
[/code]
Like in your previous example. Then I can parse it myself. I just don't like the way it's presented in the current form I have.
Here's the full data [url]http://pastebin.com/Hg8uvwgB[/url]
Just to be clear though, I need to be able to explain anything I do so I need to understand what's going on i.e. please don't just parse the entire data without explaining anything (not to sound ungrateful_
[editline]25th April 2013[/editline]
When I try to follow your code I get an error with
[code]
years = map(int, line_split[1][1::2])
[/code]
[code]
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
years = map(int, line_split[1][1::2])
TypeError: 'generator' object is not subscriptable
[/code]
[code]#!/usr/bin/env python3
import itertools
file = open('data.txt', 'r')
periods = file.readline().strip().split()[::2]
raw_data = (file.read().strip().split())
values = raw_data[0::2]
years = raw_data[1::2]
data = zip(years, itertools.cycle(periods), values)
print('\n'.join(' '.join(datum) for datum in data))[/code]
Also works with Python 2.
[QUOTE=jaooe;40418163]When I try to follow your code I get an error with
[code]
years = map(int, line_split[1][1::2])
[/code]
[code]
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
years = map(int, line_split[1][1::2])
TypeError: 'generator' object is not subscriptable
[/code][/QUOTE]
Accidentally left line_split as a generator instead of a list. I fixed it now.
(f(x) for x in xs) creates an iterable generator. It doesn't actually create a list and it doesn't store it in memory, thus indexing line_split[i] doesn't really make sense.
[f(x) for x in xs] does create a list, or you can turn a generator into a list with list(your_generator).
I feel so out of my depth, are there any documents you would recommend to help me understand all of this?
Maybe you're better off with a Java solution:
[code]import java.io.File;
import java.util.Scanner;
class DataFormatter {
public static void main(String[] args) throws Exception {
Scanner file = new Scanner(new File("data.txt"));
String[] periods = file.nextLine().trim().split("\\s+Year\\s*");
int periodsIdx = 0;
file.useDelimiter("\\s+");
while(file.hasNext())
{
String value = file.next();
String year = file.next();
System.out.printf("%s %s %s\n", year, periods[periodsIdx], value);
if(++periodsIdx == periods.length)
{
periodsIdx = 0;
}
}
}
}
[/code]
Sorry, you need to Log In to post a reply to this thread.